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Preface 


Deep learning uses multilayered neural networks trained with large data sets to 
solve complex information processing tasks and has emerged as the most successful 
paradigm in the field of machine learning. Over the last decade, deep learning has 
revolutionized many domains including computer vision, speech recognition, and 
natural language processing, and it is being used in a growing multitude of applica- 
tions across healthcare, manufacturing, commerce, finance, scientific discovery, and 
many other sectors. Recently, massive neural networks, known as large language 
models and comprising of the order of a trillion learnable parameters, have been 
found to exhibit the first indications of general artificial intelligence and are now 
driving one of the biggest disruptions in the history of technology. 


Goals of the book 


This expanding impact has been accompanied by an explosion in the number 
and breadth of research publications in machine learning, and the pace of innova- 
tion continues to accelerate. For newcomers to the field, the challenge of getting 
to grips with the key ideas, let alone catching up to the research frontier, can seem 
daunting. Against this backdrop, Deep Learning: Foundations and Concepts aims 
to provide newcomers to machine learning, as well as those already experienced in 
the field, with a thorough understanding of both the foundational ideas that underpin 
deep learning as well as the key concepts of modern deep learning architectures and 
techniques. This material will equip the reader with a strong basis for future spe- 
cialization. Due to the breadth and pace of change in the field, we have deliberately 
avoided trying to create a comprehensive survey of the latest research. Instead, much 
of the value of the book derives from a distillation of key ideas, and although the field 
itself can be expected to continue its rapid advance, these foundations and concepts 
are likely to stand the test of time. For example, large language models have been 
evolving very rapidly at the time of writing, yet the underlying transformer archi- 
tecture and attention mechanism have remained largely unchanged for the last five 
years, while many core principles of machine learning have been known for decades. 
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Responsible use of technology 


Deep learning is a powerful technology with broad applicability that has the po- 
tential to create huge value for the world and address some of society’s most pressing 
challenges. However, these same attributes mean that deep learning also has poten- 
tial both for deliberate misuse and to cause unintended harms. We have chosen not 
to discuss ethical or societal aspects of the use of deep learning, as these topics are of 
such importance and complexity that they warrant a more thorough treatment than is 
possible in a technical textbook such as this. Such considerations should, however, 
be informed by a solid grounding in the underlying technology and how it works, 
and so we hope that this book will make a valuable contribution towards these im- 
portant discussions. The reader is, nevertheless, strongly encouraged to be mindful 
about the broader implications of their work and to learn about the responsible use 
of deep learning and artificial intelligence alongside their studies of the technology 
itself. 


Structure of the book 


The book is structured into a relatively large number of smaller bite-sized chap- 
ters, each of which explores a specific topic. The book has a linear structure in 
the sense that each chapter depends only on material covered in earlier chapters. It 
is well suited to teaching a two-semester undergraduate or postgraduate course on 
machine learning but is equally relevant to those engaged in active research or in 
self-study. 

A clear understanding of machine learning can be achieved only through the 
use of some level of mathematics. Specifically, three areas of mathematics lie at the 
heart of machine learning: probability theory, linear algebra, and multivariate cal- 
culus. The book provides a self-contained introduction to the required concepts in 
probability theory and includes an appendix that summarizes some useful results in 
linear algebra. It is assumed that the reader already has some familiarity with the 
basic concepts of multivariate calculus although there are appendices that provide 
introductions to the calculus of variations and to Lagrange multipliers. The focus 
of the book, however, is on conveying a clear understanding of ideas, and the em- 
phasis is on techniques that have real-world practical value rather than on abstract 
theory. Where possible we try to present more complex concepts from multiple com- 
plementary perspectives including textual description, diagrams, and mathematical 
formulae. In addition, many of the key algorithms discussed in the text are summa- 
rized in separate boxes. These do not address issues of computational efficiency, but 
are provided as a complement to the mathematical explanations given in the text. 
We therefore hope that the material in this book will be accessible to readers from a 
variety of backgrounds. 

Conceptually, this book is perhaps most naturally viewed as a successor to Neu- 
ral Networks for Pattern Recognition (Bishop, 1995b), which provided the first com- 
prehensive treatment of neural networks from a statistical perspective. It can also 
be considered as a companion volume to Pattern Recognition and Machine Learn- 
ing (Bishop, 2006), which covered a broader range of topics in machine learning 
although it predated the deep learning revolution. However, to ensure that this 
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new book is self-contained, appropriate material has been carried over from Bishop 
(2006) and refactored to focus on those foundational ideas that are needed for deep 
learning. This means that there are many interesting topics in machine learning dis- 
cussed in Bishop (2006) that remain of interest today but which have been omitted 
from this new book. For example, Bishop (2006) discusses Bayesian methods in 
some depth, whereas this book is almost entirely non-Bayesian. 

The book is accompanied by a web site that provides supporting material, in- 
cluding a free-to-use digital version of the book as well as solutions to the exercises 
and downloadable versions of the figures in PDF and JPEG formats: 


https://www.bishopbook.com 
The book can be cited using the following BibTex entry: 


@book{Bishop:DeepLearning24, 
author = {Christopher M. Bishop and Hugh Bishop}, 
title = {Deep Learning: Foundations and Concepts}, 
year = {2024}, 
publisher = {Springer} 

} 


If you have any feedback on the book or would like to report any errors, please 
send these to feedback@bishopbook.com 


References 


In the spirit of focusing on core ideas, we make no attempt to provide a com- 
prehensive literature review, which in any case would be impossible given the scale 
and pace of change of the field. We do, however, provide references to some of the 
key research papers as well as review articles and other sources of further reading. 
In many cases, these also provide important implementation details that we gloss 
over in the text in order not to distract the reader from the central concepts being 
discussed. 

Many books have been written on the subject of machine learning in general and 
on deep learning in particular. Those which are closest in level and style to this book 
include Bishop (2006), Goodfellow, Bengio, and Courville (2016), Murphy (2022), 
Murphy (2023), and Prince (2023). 

Over the last decade, the nature of machine learning scholarship has changed 
significantly, with many papers being posted online on archival sites ahead of, or 
even instead of, submission to peer-reviewed conferences and journals. The most 
popular of these sites is arXiv, pronounced ‘archive’, and is available at 


https://arXiv.org 


The site allows papers to be updated, often leading to multiple versions associated 
with different calendar years, which can result in some ambiguity as to which version 
should be cited and for which year. It also provides free access to a PDF of each pa- 
per. We have therefore adopted a simple approach of referencing the paper according 
to the year of first upload, although we recommend reading the most recent version. 
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Papers on arXiv are indexed using a notation arXiv: YYMM.XXXXX where YY and 
MM denote the year and month of first upload, respectively. Subsequent versions are 
denoted by appending a version number N in the form arXiv: YYMM. XXXXXvN. 


Exercises 


Each chapter concludes with a set of exercises designed to reinforce the key 
ideas explained in the text or to develop and generalize them in significant ways. 
These exercises form an important part of the text and each is graded according to 
difficulty ranging from (x), which denotes a simple exercise taking a few moments 
to complete, through to («x x), which denotes a significantly more complex exercise. 
The reader is strongly encouraged to attempt the exercises since active participation 
with the material greatly increases the effectiveness of learning. Worked solutions to 
all of the exercises are available as a downloadable PDF file from the book web site. 


Mathematical notation 


We follow the same notation as Bishop (2006). For an overview of mathematics 
in the context of machine learning, see Deisenroth, Faisal, and Ong (2020). 

Vectors are denoted by lower case bold roman letters such as x, whereas matrices 
are denoted by uppercase bold roman letters, such as M. All vectors are assumed to 
be column vectors unless otherwise stated. A superscript T denotes the transpose of a 
matrix or vector, so that x7 will be a row vector. The notation (w1, . . . , wm ) denotes 
a row vector with M elements, and the corresponding column vector is written as 
w = (w1,..., wm)". The M x M identity matrix (also known as the unit matrix) 
is denoted Iw, which will be abbreviated to I if there is no ambiguity about its 
dimensionality. It has elements J;; that equal 1 if 2 = j and 0 if i # j. The elements 
of a unit matrix are sometimes denoted by 6;;. The notation 1 denotes a column 
vector in which all elements have the value 1. a © b denotes the concatenation of 
vectors a and b, so that if a = (a,,...,ay) and b = (b;,...,b,,) then a ẹ b = 
(a1,...,@n,61,...,bar). |x| denotes the modulus (the positive part) of a scalar x, 
also known as the absolute value. We use det A to denote the determinant of a matrix 
A. 

The notation x ~ p(x) signifies that x is sampled from the distribution p(x). 
Where there is ambiguity, we will use subscripts as in p,.(-) to denote which density 
is referred to. The expectation of a function f(x, y) with respect to a random variable 
x is denoted by E,,[ f(x, y)]. In situations where there is no ambiguity as to which 
variable is being averaged over, this will be simplified by omitting the suffix, for 
instance Efx]. If the distribution of x is conditioned on another variable z, then 
the corresponding conditional expectation will be written E,[f(a)|z]. Similarly, the 
variance of f(x) is denoted var[f(a)], and for vector variables, the covariance is 
written cov|[x, y]. We will also use cov[x] as a shorthand notation for cov[x, x]. 

The symbol V means ‘for all’, so that Vm € M denotes all values of m within 
the set M. We use R to denote the real numbers. On a graph, the set of neighbours of 
node i is denoted M (i), which should not be confused with the Gaussian or normal 
distribution N (x|u, o°). A functional is denoted f[y] where y(x) is some function. 
The concept of a functional is discussed in Appendix B. Curly braces { } denote a 
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set. The notation g(x) = O(f(«)) denotes that | f(x)/g(a)| is bounded as z — oo. 
For instance, if g(a) = 3x? + 2, then g(x) = O(x?). The notation || denotes the 
‘floor’ of x, i.e., the largest integer that is less than or equal to x. 

If we have N independent and identically distributed (i.i.d.) values x1,..., Xy 
of a D-dimensional vector x = (z1,..., £p)", we can combine the observations 
into a data matrix X of dimension N x D in which the nth row of X corresponds 
to the row vector xT. Thus, the n, i element of X corresponds to the ith element of 
the nth observation Xn and is written £ni. For one-dimensional variables, we denote 
such a matrix by X, which is a column vector whose nth element is £n. Note that 
X (which has dimensionality NV) uses a different typeface to distinguish it from x 
(which has dimensionality D). 
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Machine learning today is one of the most important, and fastest growing, fields 
of technology. Applications of machine learning are becoming ubiquitous, and so- 
lutions learned from data are increasingly displacing traditional hand-crafted algo- 
rithms. This has not only led to improved performance for existing technologies but 
has opened the door to a vast range of new capabilities that would be inconceivable 
if new algorithms had to be designed explicitly by hand. 

One particular branch of machine learning, known as deep learning, has emerged 
as an exceptionally powerful and general-purpose framework for learning from data. 
Deep learning is based on computational models called neural networks which were 
originally inspired by mechanisms of learning and information processing in the hu- 
man brain. The field of artificial intelligence, or AI, seeks to recreate the powerful 
capabilities of the brain in machines, and today the terms machine learning and AI 
are often used interchangeably. Many of the AI systems in current use represent ap- 
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plications of machine learning which are designed to solve very specific and focused 
problems, and while these are extremely useful they fall far short of the tremendous 
breadth of capabilities of the human brain. This has led to the introduction of the 
term artificial general intelligence, or AGI, to describe the aspiration of building 
machines with this much greater flexibility. After many decades of steady progress, 
machine learning has now entered a phase of very rapid development. Recently, 
massive deep learning systems called large language models have started to exhibit 
remarkable capabilities that have been described as the first indications of artificial 
general intelligence (Bubeck et al., 2023). 


The Impact of Deep Learning 


We begin our discussion of machine learning by considering four examples drawn 
from diverse fields to illustrate the huge breadth of applicability of this technology 
and to introduce some basic concepts and terminology. What is particularly remark- 
able about these and many other examples is that they have all been addressed using 
variants of the same fundamental framework of deep learning. This is in sharp con- 
trast to conventional approaches in which different applications are tackled using 
widely differing and specialist techniques. It should be emphasized that the exam- 
ples we have chosen represent only a tiny fraction of the breadth of applicability for 
deep neural networks and that almost every domain where computation has a role is 
amenable to the transformational impact of deep learning. 


1.1.1 Medical diagnosis 


Consider first the application of machine learning to the problem of diagnosing 
skin cancer. Melanoma is the most dangerous kind of skin cancer but is curable 
if detected early. Figure 1.1 shows example images of skin lesions, with malig- 
nant melanomas on the top row and benign nevi on the bottom row. Distinguishing 
between these two classes of image is clearly very challenging, and it would be vir- 
tually impossible to write an algorithm by hand that could successfully classify such 
images with any reasonable level of accuracy. 

This problem has been successfully addressed using deep learning (Esteva et 
al., 2017). The solution was created using a large set of lesion images, known as 


Examples of skin lesions cor- 
responding to dangerous ma- 
lignant melanomas on the top 
row and benign nevi on the bot- 
tom row. It is difficult for the 
untrained eye to distinguish be- 
tween these two classes. 
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a training set, each of which is labelled as either malignant or benign, where the 
labels are obtained from a biopsy test that is considered to provide the true class 
of the lesion. The training set is used to determine the values of some 25 million 
adjustable parameters, known as weights, in a deep neural network. This process of 
setting the parameter values from data is known as learning or training. The goal 
is for the trained network to predict the correct label for a new lesion just from the 
image alone without needing the time-consuming step of taking a biopsy. This is an 
example of a supervised learning problem because, for each training example, the 
network is told the correct label. It is also an example of a classification problem 
because each input must be assigned to a discrete set of classes (benign or malignant 
in this case). Applications in which the output consists of one or more continuous 
variables are called regression problems. An example of a regression problem would 
be the prediction of the yield in a chemical manufacturing process in which the inputs 
consist of the temperature, the pressure, and the concentrations of reactants. 

An interesting aspect of this application is that the number of labelled training 
images available, roughly 129,000, is considered relatively small, and so the deep 
neural network was first trained on a much larger data set of 1.28 million images of 
everyday objects (such as dogs, buildings, and mushrooms) and then fine-tuned on 
the data set of lesion images. This is an example of transfer learning in which the 
network learns the general properties of natural images from the large data set of 
everyday objects and is then specialized to the specific problem of lesion classifica- 
tion. Through the use of deep learning, the classification of skin lesion images has 
reached a level of accuracy that exceeds that of professional dermatologists (Brinker 
et al., 2019). 


1.1.2 Protein structure 


Proteins are sometimes called the building blocks of living organisms. They are 
biological molecules that consist of one or more long chains of units called amino 
acids, of which there are 22 different types, and the protein is specified by the se- 
quence of amino acids. Once a protein has been synthesized inside a living cell, it 
folds into a complex three-dimensional structure whose behaviour and interactions 
are strongly determined by its shape. Calculating this 3D structure, given the amino 
acid sequence, has been a fundamental open problem in biology for half a century 
that had seen relatively little progress until the advent of deep learning. 

The 3D structure can be measured experimentally using techniques such as X- 
ray crystallography, cryogenic electron microscopy, or nuclear magnetic resonance 
spectroscopy. However, this can be extremely time-consuming and for some pro- 
teins can prove to be challenging, for example due to the difficulty of obtaining a 
pure sample or because the structure is dependent on the context. In contrast, the 
amino acid sequence of a protein can be determined experimentally at lower cost 
and higher throughput. Consequently, there is considerable interest in being able 
to predict the 3D structures of proteins directly from their amino acid sequences in 
order to better understand biological processes or for practical applications such as 
drug discovery. A deep learning model can be trained to take an amino acid se- 
quence as input and generate the 3D structure as output, in which the training data 


Figure 1.2 
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Illustration of the 3D shape of a pro- 
tein called T1044/6VR4. The green 
structure shows the ground truth 
as determined by X-ray crystallog- 
raphy, whereas the superimposed 
blue structure shows the prediction 
obtained by a deep learning model 
called AlphaFold. [From Jumper et 
al. (2021) with permission.] 


consist of a set of proteins for which the amino acid sequence and the 3D structure 
are both known. Protein structure prediction is therefore another example of super- 
vised learning. Once the system is trained it can take a new amino acid sequence as 
input and can predict the associated 3D structure (Jumper et al., 2021). Figure 1.2 
compares the predicted 3D structure of a protein and the ground truth obtained by 
X-ray crystallography. 


1.1.3 Image synthesis 


In the two applications discussed so far, a neural network learned to transform 
an input (a skin image or an amino acid sequence) into an output (a lesion classifica- 
tion or a 3D protein structure, respectively). We turn now to an example where the 
training data consist simply of a set of sample images and the goal of the trained net- 
work is to create new images of the same kind. This is an example of unsupervised 
learning because the images are unlabelled, in contrast to the lesion classification 
and protein structure examples. Figure 1.3 shows examples of synthetic images gen- 
erated by a deep neural network trained on a set of images of human faces taken in a 
studio against a plain background. Such synthetic images are of exceptionally high 
quality and it can be difficult tell them apart from photographs of real people. 

This is an example of a generative model because it can generate new output 
examples that differ from those used to train the model but which share the same 
Statistical properties. A variant of this approach allows images to be generated that 
depend on an input text string known, as a prompt, so that the image content reflects 
the semantics of the text input. The term generative AI is used to describe deep learn- 
ing models that generate outputs in the form of images, video, audio, text, candidate 
drug molecules, or other modalities. 
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Figure 1.3 Synthetic face images generated by a deep neural network trained using unsupervised learning. 
[From https://generated.photos.] 


1.1.4 Large language models 


One of most important advances in machine learning in recent years has been 
the development of powerful models for processing natural language and other forms 
of sequential data such as source code. A large language model, or LLM, uses deep 
learning to build rich internal representations that capture the semantic properties 
of language. An important class of large language models, called autoregressive 
language models, can generate language as output, and therefore, they are a form of 
generative AI. Such models take a sequence of words as the input and for the output, 
generate a single word that represents the next word in the sequence. The augmented 
sequence, with the new word appended at the end, can then be fed through the model 
again to generate the subsequent word, and this process can be repeated to generate 
a long sequence of words. Such models can also output a special ‘stop’ word that 
signals the end of text generation, thereby allowing them to output text of finite 
length and then halt. At that point, a user could append their own series of words 
to the sequence before feeding the complete sequence back through the model to 
trigger further word generation. In this way, it is possible for a human to have a 
conversation with the neural network. 

Such models can be trained on large data sets of text by extracting training pairs 
each consisting of a randomly selected sequence of words as input with the known 
next word as the target output. This is an example of self-supervised learning in 
which a function from inputs to outputs is learned but where the labelled outputs are 
obtained automatically from the input training data without needing separate human- 
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10 points, shown as blue circles, 1 
each comprising an observation 

of the input variable x along with 

the corresponding target variable 

t. The green curve shows the t 
function sin(27x) used to gener- 

ate the data. Our goal is to pre- 

dict the value of t for some new —1 
value of x, without knowledge of 

the green curve. 


derived labels. Since large volumes of text are available from multiple sources, this 
approach allows for scaling to very large training sets and associated very large neu- 
ral networks. 

Large language models can exhibit extraordinary capabilities that have been de- 
scribed as the first indications of emerging artificial general intelligence (Bubeck et 
al., 2023), and we discuss such models at length later in the book. On the next page, 
we give an illustration of language generation, based on a model called GPT-4 (Ope- 
nAI, 2023), in response to an input prompt ‘Write a proof of the fact that there are 
infinitely many primes; do it in the style of a Shakespeare play through a dialogue 
between two parties arguing over the proof.’. 


A Tutorial Example 


For the newcomer to the field of machine learning, many of the basic concepts and 
much of the terminology can be introduced in the context of a simple example in- 
volving the fitting of a polynomial to a small synthetic data set (Bishop, 2006). This 
is a form of supervised learning problem in which we would like to make a prediction 
for a target variable, given the value of an input variable. 


1.2.1 Synthetic data 


We denote the input variable by x and the target variable by t, and we assume 
that both variables take continuous values on the real axis. Suppose that we are 
given a training set comprising N observations of x, written x1,..., £y, together 
with corresponding observations of the values of t, denoted t),...,¢. Our goal 
is to predict the value of t for some new value of x. The ability to make accurate 
predictions on previously unseen inputs is a key goal in machine learning and is 
known as generalization. 

We can illustrate this using a synthetic data set generated by sampling from a 
sinusoidal function. Figure 1.4 shows a plot of a training set comprising N = 10 data 
points in which the input values were generated by choosing values of £n, for n = 
1,..., N, spaced uniformly in the range [0, 1]. The associated target data values were 
obtained by first computing the values of the function sin(272) for each value of x 
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GPT-4: Proof of infinitely many primes 


FIRST PARTY: 

Thou dost not understand, my friend, 
There simply is no upper end 

To primes, those numbers indivisible, 


No matter if it seems impossible. 


SECOND PARTY: 
But how dost thou so surely know? 


To prove this truth, what canst thou show? 


FIRST PARTY: 

Consider this, my doubtful peer, 

A clever proof that will make clear: 
Assume that there’s a final prime, 
The largest one we'll see in time. 
Let's call this number "P™ for short, 
And then, my friend, we will purport 
To build a number "Q" that’s new, 


One larger than "P" and prime too. 


SECOND PARTY: 


But how, my friend, will this be done? 


The proof is not yet fairly won. 


FIRST PARTY: 

It’s simple, friend, just wait and see, 
We'll multiply the primes, that’s key. 
All primes from two to "P" combined, 
Will make this new "Q" we will find. 
But one thing more we must include, 

To make the proof more watertight, 
We’ll add a one onto the end, 


And now the proof is in our sight. 


SECOND PARTY: 

see it now, it’s crystal clear, 
For "Q" must be a prime, my dear. 

t can’t divide by any prime, 

All smaller than itself in time. 
And since we chose "P" as the last, 


This proves that our assumption’s past. 


There is no final prime, we see, 


[The numbers go on endlessly. 
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and then adding a small level of random noise (governed by a Gaussian distribution) 
to each such point to obtain the corresponding target value tn. By generating data 
in this way, we are capturing an important property of many real-world data sets, 
namely that they possess an underlying regularity, which we wish to learn, but that 
individual observations are corrupted by random noise. This noise might arise from 
intrinsically stochastic (i.e., random) processes such as radioactive decay but more 
typically is due to there being sources of variability that are themselves unobserved. 

In this tutorial example we know the true process that generated the data, namely 
the sinusoidal function. In a practical application of machine learning, our goal is to 
discover the underlying trends in the data given the finite training set. Knowing the 
process that generated the data, however, allows us to illustrate important concepts 
in machine learning. 


1.2.2 Linear models 


Our goal is to exploit this training set to predict the value ¢ of the target variable 
for some new value Z of the input variable. As we will see later, this involves im- 
plicitly trying to discover the underlying function sin(27z). This is intrinsically a 
difficult problem as we have to generalize from a finite data set to an entire function. 
Furthermore, the observed data is corrupted with noise, and so for a given Ẹ there 
is uncertainty as to the appropriate value for t. Probability theory provides a frame- 
work for expressing such uncertainty in a precise and quantitative manner, whereas 
decision theory allows us to exploit this probabilistic representation to make predic- 
tions that are optimal according to appropriate criteria. Learning probabilities from 
data lies at the heart of machine learning and will be explored in great detail in this 
book. 

To start with, however, we will proceed rather informally and consider a simple 
approach based on curve fitting. In particular, we will fit the data using a polynomial 
function of the form 


M 
ylz, w) = wo + wz +4 wr? +... + wyr” = J wx (1.1) 
j=0 


where M is the order of the polynomial, and x/ denotes x raised to the power of j. 
The polynomial coefficients wo,..., wag are collectively denoted by the vector w. 
Note that, although the polynomial function y(x, w) is a nonlinear function of x, it 
is a linear function of the coefficients w. Functions, such as this polynomial, that are 
linear in the unknown parameters have important properties, as well as significant 
limitations, and are called linear models. 


1.2.3 Error function 


The values of the coefficients will be determined by fitting the polynomial to the 
training data. This can be done by minimizing an error function that measures the 
misfit between the function y(x, w), for any given value of w, and the training set 
data points. One simple choice of error function, which is widely used, is the sum of 
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Figure 1.5 The error function (1.2) cor- 
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responds to (one half of) 

the sum of the squares of tr 
the displacements (shown 

by the vertical green arrows) 

of each data point from the 

function y(x, w). 


Y(En, W) 


Tn 


the squares of the differences between the predictions y(x,,, w) for each data point 
£n and the corresponding target value tn, given by 


N 
E(w) = 5 amw) -ta (1.2) 


where the factor of 1/2 is included for later convenience. We will later derive this 
error function starting from probability theory. Here we simply note that it is a non- 
negative quantity that would be zero if, and only if, the function y(x, w) were to 
pass exactly through each training data point. The geometrical interpretation of the 
sum-of-squares error function is illustrated in Figure 1.5. 

We can solve the curve fitting problem by choosing the value of w for which 
E(w) is as small as possible. Because the error function is a quadratic function of 
the coefficients w, its derivatives with respect to the coefficients will be linear in the 
elements of w, and so the minimization of the error function has a unique solution, 
denoted by w*, which can be found in closed form. The resulting polynomial is 
given by the function y(x, w*). 


1.2.4 Model complexity 


There remains the problem of choosing the order M of the polynomial, and as 
we will see this will turn out to be an example of an important concept called model 
comparison or model selection. In Figure 1.6, we show four examples of the results 
of fitting polynomials having orders M = 0,1,3, and 9 to the data set shown in 
Figure 1.4. 

Notice that the constant (M = 0) and first-order (M = 1) polynomials give poor 
fits to the data and consequently poor representations of the function sin(27z). The 
third-order (M = 3) polynomial seems to give the best fit to the function sin(272) of 
the examples shown in Figure 1.6. When we go to a much higher order polynomial 
(M = 9), we obtain an excellent fit to the training data. In fact, the polynomial 
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Figure 1.6 Plots of polynomials having various orders M, shown as red curves, fitted to the data set shown in 
Figure 1.4 by minimizing the error function (1.2). 


passes exactly through each data point and E(w*) = 0. However, the fitted curve 
oscillates wildly and gives a very poor representation of the function sin(27z). This 
latter behaviour is known as over-fitting. 

Our goal is to achieve good generalization by making accurate predictions for 
new data. We can obtain some quantitative insight into the dependence of the gener- 
alization performance on M by considering a separate set of data known as a fest set, 
comprising 100 data points generated using the same procedure as used to generate 
the training set points. For each value of M, we can evaluate the residual value of 
E(w”) given by (1.2) for the training data, and we can also evaluate E(w*) for the 
test data set. Instead of evaluating the error function E(w), it is sometimes more 
convenient to use the root-mean-square (RMS) error defined by 


N 
1 
Erms = 4) 97 22 Enw) — ta}? (1.3) 
n=1 


in which the division by N allows us to compare different sizes of data sets on an 
equal footing, and the square root ensures that Erms is measured on the same scale 
(and in the same units) as the target variable t. Graphs of the training-set and test-set 
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Figure 1.7 Graphs of the root-mean- 
square error, defined by (1.3), evaluated on 1 
the training set, and on an independent test 


set, for various values of M. © Training 


\ © Test 


Erms 


M 


RMS errors are shown, for various values of M, in Figure 1.7. The test set error 
is a measure of how well we are doing in predicting the values of t for new data 
observations of x. Note from Figure 1.7 that small values of M give relatively large 
values of the test set error, and this can be attributed to the fact that the corresponding 
polynomials are rather inflexible and are incapable of capturing the oscillations in 
the function sin(2rx). Values of M in the range 3 < M < 8 give small values 
for the test set error, and these also give reasonable representations of the generating 
function sin(272:), as can be seen for M = 3 in Figure 1.6. 

For M = 9, the training set error goes to zero, as we might expect because 
this polynomial contains 10 degrees of freedom corresponding to the 10 coefficients 
Wo,---, Wg, and so can be tuned exactly to the 10 data points in the training set. 
However, the test set error has become very large and, as we saw in Figure 1.6, the 
corresponding function y(x, w*) exhibits wild oscillations. 

This may seem paradoxical because a polynomial of a given order contains all 
lower-order polynomials as special cases. The M = 9 polynomial is therefore ca- 
pable of generating results at least as good as the M = 3 polynomial. Furthermore, 
we might suppose that the best predictor of new data would be the function sin(27) 
from which the data was generated (and we will see later that this is indeed the case). 
We know that a power series expansion of the function sin(27) contains terms of all 
orders, so we might expect that results should improve monotonically as we increase 
M. 

We can gain some insight into the problem by examining the values of the co- 
efficients w* obtained from polynomials of various orders, as shown in Table 1.1. 
We see that, as M increases, the magnitude of the coefficients typically gets larger. 
In particular for the M = 9 polynomial, the coefficients have become finely tuned 
to the data. They have large positive and negative values so that the corresponding 
polynomial function matches each of the data points exactly, but between data points 
(particularly near the ends of the range) the function exhibits the large oscillations 
observed in Figure 1.6. Intuitively, what is happening is that the more flexible poly- 
nomials with larger values of M are increasingly tuned to the random noise on the 
target values. 

Further insight into this phenomenon can be gained by examining the behaviour 
of the learned model as the size of the data set is varied, as shown in Figure 1.8. We 
see that, for a given model complexity, the over-fitting problem become less severe 
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Figure 1.8 Plots of the solutions obtained by minimizing the sum-of-squares error function (1.2) using the 
M = 9 polynomial for N = 15 data points (left plot) and N = 100 data points (right plot). We see that increasing 
the size of the data set reduces the over-fitting problem. 
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Table 1.1 


as the size of the data set increases. Another way to say this is that with a larger 
data set, we can afford to fit a more complex (in other words more flexible) model 
to the data. One rough heuristic that is sometimes advocated in classical statistics 
is that the number of data points should be no less than some multiple (say 5 or 
10) of the number of learnable parameters in the model. However, when we discuss 
deep learning later in this book, we will see that excellent results can be obtained 
using models that have significantly more parameters than the number of training 
data points. 


1.2.5 Regularization 


There is something rather unsatisfying about having to limit the number of pa- 
rameters in a model according to the size of the available training set. It would seem 
more reasonable to choose the complexity of the model according to the complexity 
of the problem being solved. One technique that is often used to control the over- 
fitting phenomenon, as an alternative to limiting the number of parameters, is that 
of regularization, which involves adding a penalty term to the error function (1.2) to 
discourage the coefficients from having large magnitudes. The simplest such penalty 


Table of the coefficients w* M=0 M=1 M=3 M=9 
for polynomials of various or- ¥ 
a Obese how the typ- wo GA oao 012 0.20 
ical magnitude of the coeffi- wi —1.58 11.20 — 66.18 
cients increases dramatically w3 —33.67 1, 665.69 
as the order of the polynomial w3 22.43  —15, 566.61 
increases. wi 76, 321.23 
we —217, 389.15 
WE 370, 626.48 
wr —372, 051.47 
we 202, 540.70 
wg —46, 080.94 
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Figure 1.9 Plots of M = 9 polynomials fitted to the data set shown in Figure 1.4 using the regularized error 
function (1.4) for two values of the regularization parameter A corresponding to In \ = —18 and ln à = 0. The 
case of no regularizer, i.e., A = 0, corresponding to In \ = —oo, is shown at the bottom right of Figure 1.6. 
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term takes the form of the sum of the squares of all of the coefficients, leading to a 
modified error function of the form 


N 
~ 1 À 
B(w) = 3) lEn w) = ta}? + Slo? (1.4) 
n=1 
where || w||? = wi w = we +w? +...+w%,, and the coefficient À governs the rel- 


ative importance of the regularization term compared with the sum-of-squares error 
term. Note that often the coefficient wo is omitted from the regularizer because its 
inclusion causes the results to depend on the choice of origin for the target variable 
(Hastie, Tibshirani, and Friedman, 2009), or it may be included but with its own 
regularization coefficient. Again, the error function in (1.4) can be minimized ex- 
actly in closed form. Techniques such as this are known in the statistics literature as 
shrinkage methods because they reduce the value of the coefficients. In the context 
of neural networks, this approach is known as weight decay because the parameters 
in a neural network are called weights and this regularizer encourages them to decay 
towards zero. 

Figure 1.9 shows the results of fitting the polynomial of order M = 9 to the 
same data set as before but now using the regularized error function given by (1.4). 
We see that, for a value of In A = —18, the over-fitting has been suppressed and we 
now obtain a much closer representation of the underlying function sin(272). If, 
however, we use too large a value for À then we again obtain a poor fit, as shown in 
Figure 1.9 for In A = 0. The corresponding coefficients from the fitted polynomials 
are given in Table 1.2, showing that regularization has the desired effect of reducing 
the magnitude of the coefficients. 

The impact of the regularization term on the generalization error can be seen by 
plotting the value of the RMS error (1.3) for both training and test sets against In À, 
as shown in Figure 1.10. We see that A now controls the effective complexity of the 
model and hence determines the degree of over-fitting. 
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Figure 1.10 Graph 


of the root-mean- 


square error (1.3) versus In A for the M = 9 


polynomial. 


Table 1.2 


— Training 
— Test 


—30 -20 -10 -0 
M 


1.2.6 Model selection 


The quantity \ is an example of a hyperparameter whose values are fixed during 
the minimization of the error function to determine the model parameters w. We 
cannot simply determine the value of À by minimizing the error function jointly with 
respect to w and À since this will lead to A — 0 and an over-fitted model with small 
or zero training error. Similarly, the order M of the polynomial is a hyperparameter 
of the model, and simply optimizing the training set error with respect to M will 
lead to large values of M and associated over-fitting. We therefore need to find a 
way to determine suitable values for hyperparameters. The results above suggest a 
simple way of achieving this, namely by taking the available data and partitioning it 
into a training set, used to determine the coefficients w, and a separate validation set, 
also called a hold-out set or a development set. We then select the model having the 
lowest error on the validation set. If the model design is iterated many times using a 
data set of limited size, then some over-fitting to the validation data can occur, and 
so it may be necessary to keep aside a third test set on which the performance of the 
selected model can finally be evaluated. 

For some applications, the supply of data for training and testing will be limited. 
To build a good model, we should use as much of the available data as possible for 
training. However, if the validation set is too small, it will give a relatively noisy 
estimate of predictive performance. One solution to this dilemma is to use cross- 


Table of the coefficients w* for In\=-—oco In\=—-18 In\=0 
M = 9 polynomials with various wa 0.26 0.26 0.11 
values for the regularization param- i i ` : 
eter A. Note that In\ = —oo cor- WI — 66.13 0.64 —0.07 
responds to a model with no regu- w3 1, 665.69 43.68 —0.09 
larization, i.e., to the graph at the wł —15, 566.61 —144.00 —0.07 
bottom right in Figure 1.6. We see wt 76, 321.23 57.90 —0.05 
that, as the value of A increases, wt | —217, 389.15 117.36 —0.04 
the magnitude of a typical coeffi- 2 ( ' i i 
i e we | 370,626.48 9.87  —0.02 
ws | —372,051.47 —90.02 —0.01 
we 202, 540.70 —70.90 —0.01 
ws —46, 080.94 75.26 0.00 
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Figure 1.11 The technique of S-fold cross-validation, illus- | rünt 
trated here for the case of S = 4, involves tak- 
ing the available data and partitioning it into S 
groups of equal size. Then S — 1 of the groups LC ty | )] run 2 
are used to train a set of models that are then 
evaluated on the remaining group. This proce- [ [| P |) run 3 
dure is then repeated for all S possible choices 


for the held-out group, indicated here by the [ [| | P umd 


red blocks, and the performance scores from 
the S runs are then averaged. 


validation, which is illustrated in Figure 1.11. This allows a proportion (S — 1)/S of 
the available data to be used for training while making use of all of the data to assess 
performance. When data is particularly scarce, it may be appropriate to consider the 
case S = N, where N is the total number of data points, which gives the leave-one- 
out technique. 

The main drawback of cross-validation is that the number of training runs that 
must be performed is increased by a factor of S, and this can prove problematic for 
models in which the training is itself computationally expensive. A further problem 
with techniques such as cross-validation that use separate data to assess performance 
is that we might have multiple complexity hyperparameters for a single model (for 
instance, there might be several regularization hyperparameters). Exploring combi- 
nations of settings for such hyperparameters could, in the worst case, require a num- 
ber of training runs that is exponential in the number of hyperparameters. The state 
of the art in modern machine learning involves extremely large models, trained on 
commensurately large data sets. Consequently, there is limited scope for exploration 
of hyperparameter settings, and heavy reliance is placed on experience obtained with 
smaller models and on heuristics. 

This simple example of fitting a polynomial to a synthetic data set generated 
from a sinusoidal function has illustrated many key ideas from machine learning, 
and we will make further use of this example in future chapters. However, real- 
world applications of machine learning differ in several important respects. The size 
of the data sets used for training can be many orders of magnitude larger, and there 
will generally be many more input variables, perhaps numbering in the millions for 
image analysis, for example, as well as multiple output variables. The learnable 
function that relates outputs to inputs is governed by a class of models known as 
neural networks, and these may have a large number of parameters perhaps num- 
bering in the hundreds of billions, and the error function will be a highly nonlinear 
function of those parameters. The error function can no longer be minimized through 
a closed-form solution and instead must be minimized through iterative optimization 
techniques based on evaluation of the derivatives of the error function with respect 
to the parameters, all of which may require specialist computational hardware and 
incur substantial computational cost. 
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Figure 1.12 Schematic illustration 
showing two neurons from the human 
brain. These electrically active cells 
communicate through junctions called 
synapses whose strengths change as 
the network learns. 


Dendrites 


Synapse 


Cell body 


1.3. A Brief History of Machine Learning 


Machine learning has a long and rich history, including the pursuit of multiple al- 
ternative approaches. Here we focus on the evolution of machine learning methods 
based on neural networks as these represent the foundation of deep learning and 
have proven to be the most effective approach to machine learning for real-world 
applications. 

Neural network models were originally inspired by studies of information pro- 
cessing in the brains of humans and other mammals. The basic processing units in 
the brain are electrically active cells called neurons, as illustrated in Figure 1.12. 
When a neuron ‘fires’, it sends an electrical impulse down the axon where it reaches 
junctions, called synapses, which form connections with other neurons. Chemical 
signals called neurotransmitters are released at the synapses, and these can stimu- 
late, or inhibit, the firing of subsequent neurons. 

A human brain contains around 90 billion neurons in total, each of which has on 
average several thousand synapses with other neurons, creating a complex network 
having a total of around 100 trillion (1014) synapses. If a particular neuron receives 
sufficient stimulation from the firing of other neurons then it too can be induced to 
fire. However, some synapses have a negative, or inhibitory, effect whereby the firing 
of the input neuron makes it less likely that the output neuron will fire. The extent to 
which one neuron can cause another to fire depends on the strength of the synapse, 
and it is changes in these strengths that represents a key mechanism whereby the 
brain can store information and learn from experience. 

These properties of neurons have been captured in very simple mathematical 
models, known as artificial neural networks, which then form the basis for compu- 
tational approaches to learning (McCulloch and Pitts, 1943). Many of these models 
describe the properties of a single neuron by forming a linear combination of the 
outputs of other neurons, which is then transformed using a nonlinear function. This 
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Figure 1.13 A simple neural network diagram representing the trans- 
formations (1.5) and (1.6) describing a single neuron. The 
polynomial function (1.1) can be seen as a special case of 
this model. 


can be expressed mathematically in the form 


M 
a= > Witi (1.5) 
i=1 
y = f(a) (1.6) 
where x1,..., £m represent M inputs corresponding to the activities of other neu- 
rons that send connections to this neuron, and w1, ..., wpm are continuous variables, 


called weights, which represent the strengths of the associated synapses. The quan- 
tity a is called the pre-activation, the nonlinear function f(-) is called the activation 
function, and the output y is called the activation. We can see that the polynomial 
(1.1) can be viewed as a specific instance of this representation in which the inputs 
x; are given by powers of a single variable z, and the function f (-) is just the identity 
f(a) = a. The simple mathematical formulation given by (1.5) and (1.6) has formed 
the basis of neural network models from the 1960s up to the present day, and can be 
represented in diagram form as shown in Figure 1.13. 


1.3.1 Single-layer networks 


The history of artificial neural networks can broadly be divided into three distinct 
phases according to the level of sophistication of the networks as measured by the 
number of ‘layers’ of processing. A simple neural model described by (1.5) and (1.6) 
can be viewed as having a single layer of processing corresponding to the single layer 
of connections in Figure 1.13. One of the most important such models in the history 
of neural computing is the perceptron (Rosenblatt, 1962) in which the activation 
function f(-) is a step function of the form 


0, ifa<0 
= 3 ma 3 1.7 
F(a) fi ifa>0. me) 


This can be viewed as a simplified model of neural firing in which a neuron fires if, 
and only if, the total weighted input exceeds a threshold of 0. The perceptron was 
pioneered by Rosenblatt (1962), who developed a specific training algorithm that 
has the interesting property that if there exists a set of weight values for which the 
perceptron can achieve perfect classification of its training data then the algorithm 
is guaranteed to find the solution in a finite number of steps (Bishop, 2006). As 
well as a learning algorithm, the perceptron also had a dedicated analogue hardware 


18 1. THE DEEP LEARNING REVOLUTION 


Figure 1.14 Illustration of the Mark 1 perceptron hardware. The photograph on the left shows how the inputs 
were obtained using a simple camera system in which an input scene, in this case a printed character, was 
illuminated by powerful lights, and an image focused onto a 20 x 20 array of cadmium sulphide photocells, 
giving a primitive 400-pixel image. The perceptron also had a patch board, shown in the middle photograph, 
which allowed different configurations of input features to be tried. Often these were wired up at random to 
demonstrate the ability of the perceptron to learn without the need for precise wiring, in contrast to a modern 
digital computer. The photograph on the right shows one of the racks of learnable weights. Each weight was 
implemented using a rotary variable resistor, also called a potentiometer, driven by an electric motor thereby 
allowing the value of the weight to be adjusted automatically by the learning algorithm. 


implementation, as shown in Figure 1.14. A typical perceptron configuration had 
multiple layers of processing, but only one of those layers was learnable from data, 
and so the perceptron is considered to be a ‘single-layer’ neural network. 

At first, the ability of perceptrons to learn from data in a brain-like way was con- 
sidered remarkable. However, it became apparent that the model also has major lim- 
itations. The properties of perceptrons were analysed by Minsky and Papert (1969), 
who gave formal proofs of the limited capabilities of single-layer networks. Unfortu- 
nately, they also speculated that similar limitations would extend to networks having 
multiple layers of learnable parameters. Although this latter conjecture proved to 
be wildly incorrect, the effect was to dampen enthusiasm for neural network mod- 
els, and this contributed to the lack of interest, and funding, for neural networks 
during the 1970s and early 1980s. Furthermore, researchers were unable to explore 
the properties of multilayered networks due to the lack of an effective algorithm 
for training them, since techniques such as the perceptron algorithm were specific 
to single-layer models. Note that although perceptrons have long disappeared from 
practical machine learning, the name lives on because a modern neural network is 
also sometimes called a multilayer perceptron or MLP. 


1.3.2 Backpropagation 


The solution to the problem of training neural networks having more than one 
layer of learnable parameters came from the use of differential calculus and the appli- 
cation of gradient-based optimization methods. An important change was to replace 
the step function (1.7) with continuous differentiable activation functions having a 
non-zero gradient. Another key modification was to introduce differentiable error 
functions that define how well a given choice of parameter values predicts the target 
variables in the training set. We saw an example of such an error function when we 
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Figure 1.15 A neural network having two lay- hidden units 
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ers of parameters in which arrows 
denote the direction of information 
flow through the network. Each of 
the hidden units and each of the 
output units computes a function of 
the form given by (1.5) and (1.6) in 
which the activation function f(-) is 
differentiable. 


inputs outputs 


used the sum-of-squares error function (1.2) to fit polynomials. 

With these changes, we now have an error function whose derivatives with re- 
spect to each of the parameters in the network can be evaluated. We can now consider 
networks having more than one layer of parameters. Figure 1.15 shows a simple net- 
work with two processing layers. Nodes in the middle layer called hidden units 
because their values do not appear in the training set, which only provides values 
for inputs and outputs. Each of the hidden units and each of the output units in 
Figure 1.15 computes a function of the form given by (1.5) and (1.6). For a given 
set of input values, the states of all of the hidden and output units can be evaluated 
by repeated application of (1.5) and (1.6) in which information is flowing forward 
through the network in the direction of the arrows. For this reason, such models are 
sometimes also called feed-forward neural networks. 

To train such a network the parameters are first initialized using a random num- 
ber generator and are then iteratively updated using gradient-based optimization 
techniques. This involves evaluating the derivatives of the error function, which 
can be done efficiently in a process known as error backpropagation. In backpropa- 
gation, information flows backwards through the network from the outputs towards 
the inputs (Rumelhart, Hinton, and Williams, 1986). There exist many different op- 
timization algorithms that make use of gradients of the function to be optimized, but 
the one that is most prevalent in machine learning is also the simplest and is known 
as stochastic gradient descent. 

The ability to train neural networks having multiple layers of weights was a 
breakthrough that led to a resurgence of interest in the field starting around the mid- 
1980s. This was also a period in which the field moved beyond a focus on neurobio- 
logical inspiration and developed a more rigorous and principled foundation (Bishop, 
1995b). In particular, it was recognized that probability theory, and ideas from the 
field of statistics, play a central role in neural networks and machine learning. One 
key insight is that learning from data involves background assumptions, sometimes 
called prior knowledge or inductive biases. These might be incorporated explicitly, 
for example by designing the structure of a neural network such that the classifica- 
tion of a skin lesion does not depend on the location of the lesion within the image, 
or they might take the form of implicit assumptions that arise from the mathematical 
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form of the model or the way it is trained. 

The development of backpropagation and gradient-based optimization dramati- 
cally increased the capability of neural networks to solve practical problems. How- 
ever, it was also observed that in networks with many layers, it was only weights in 
the final two layers that would learn useful values. With a few exceptions, notably 
models used for image analysis known as convolutional neural networks (LeCun 
et al., 1998), there were very few successful applications of networks having more 
than two layers. Again, this constrained the complexity of the problems that could 
be addressed effectively with these kinds of network. To achieve reasonable perfor- 
mance on many applications, it was necessary to use hand-crafted pre-processing to 
transform the input variables into some new space where, it was hoped, the machine 
learning problem would be easier to solve. This pre-processing stage is sometimes 
also called feature extraction. Although this approach was sometimes effective, it 
would clearly be much better if features could be learned from the data rather than 
being hand-crafted. 

By the start of the new millennium, the available neural network methods were 
once again reaching the limits of their capability. Researchers began to explore a 
raft of alternatives to neural networks, such as kernel methods, support vector ma- 
chines, Gaussian processes, and many others. Neural networks fell into disfavour 
once again, although a core of enthusiastic researchers continued to pursue the goal 
of a truly effective approach to training networks with many layers. 


1.3.3 Deep networks 


The third, and current, phase in the development of neural networks began dur- 
ing the second decade of the 21st century. A series of developments allowed neural 
networks with many layers of weights to be trained effectively, thereby removing 
previous limitations on the capabilities of these techniques. Networks with many lay- 
ers of weights are called deep neural networks and the sub-field of machine learning 
that focuses on such networks is called deep learning (LeCun, Bengio, and Hinton, 
2015). 

One important theme in the origins of deep learning was a significant increase 
in the scale of neural networks, measured in terms of the number of parameters. Al- 
though networks with a few hundred or a few thousand parameters were common in 
the 1980s, this steadily rose to the millions, and then billions, whereas current state- 
of-the-art models can have in the region of one trillion (10'*) parameters. Networks 
with many parameters require commensurately large data sets so that the training 
signals can produced good values for those parameters. The combination of massive 
models and massive data sets in turn requires computation on a massive scale when 
training the model. Specialist processors called graphics processing units, or GPUs, 
which had been developed for very fast rendering of graphical data for applications 
such as video games, proved to be well suited to the training of neural networks be- 
cause the functions computed by the units in one layer of a network can be evaluated 
in parallel, and this maps well onto the massive parallelism of GPUs (Krizhevsky, 
Sutskever, and Hinton, 2012). Today, training for the largest models is performed on 
large arrays of thousands of GPUs linked by specialist high-speed interconnections. 
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Figure 1.16 Plot of the number of compute cycles, measured in petaflop/s-days, needed to train a state-of-the- 
art neural network as a function of date, showing two distinct phases of exponential growth. [From OpenAl with 


permission.] 


Figure 1.16 illustrates how the number of compute cycles needed to train a state- 
of-the-art neural network has grown over the years, showing two distinct phases of 
growth. The vertical axis has an exponential scale and has units of petaflop/s-days, 
where a petaflop represents 1015 (a thousand trillion) floating point operations, and a 
petaflop/s is one petaflop per second. One petaflop/s-day represents computation at 
the rate of a petaflop/s for a period of 24 hours, which is roughly 107° floating point 
operations, and therefore, the top line of the graph represents an impressive 1074 
floating point operations. A straight line on the graph represents exponential growth, 
and we see that from the era of the perceptron up to around 2012, the doubling time 
was around 2 years, which is consistent with the general growth of computing power 
as a consequence of Moore’s law. From 2012 onward, which marks the era of deep 
learning, we again see exponential growth but the doubling time is now 3.4 months 
corresponding to a factor of 10 increase in compute power every year! 

It is often found that improvements in performance due to innovations in the 
architecture or incorporation of more sophisticated forms of inductive bias are soon 
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superseded simply by scaling up the quantity of training data, along with commen- 
surate scaling of the model size and associated compute power used for training 
(Sutton, 2019). Not only can large models have superior performance on a specific 
task but they may be capable of solving a broader range of different problems with 
the same trained neural network. Large language models are a notable example as a 
single network not only has an extraordinary breadth of capability but is even able to 
outperform specialist networks designed to solve specific problems. 

We have seen that depth plays an important role in allowing neural networks to 
achieve high performance. One way to view the role of the hidden layers in a deep 
neural network is that of representation learning (Bengio, Courville, and Vincent, 
2012) in which the network learns to transform input data into a new representation 
that is semantically meaningful thereby creating a much easier problem for the final 
layer or layers to solve. Such internal representations can be repurposed to allow for 
the solution of related problems through transfer learning, as we saw for skin lesion 
classification. It is interesting to note that neural networks used to process images 
may learn internal representations that are remarkably like those observed in the 
mammalian visual cortex. Large neural networks that can be adapted or fine-tuned 
to a range of downstream tasks are called foundation models, and can take advan- 
tage of large, heterogeneous data sets to create models having broad applicability 
(Bommasani et al., 2021). 

In addition to scaling, there were other developments that helped in the suc- 
cess of deep learning. For example, in simple neural networks, the training signals 
become weaker as they are backpropagated through successive layers of a deep net- 
work. One technique for addressing this is the introduction of residual connections 
(He et al., 2015a) that facilitate the training of networks having hundreds of layers. 
Another key development was the introduction of automatic differentiation methods 
in which the code that performs backpropagation to evaluate error function gradients 
is generated automatically from the code used to specify the forward propagation. 
This allows researchers to experiment rapidly with different architectures for a neural 
network and to combine different architectural elements in multiple ways very easily 
since only the relatively simple forward propagation functions need to be coded ex- 
plicitly. Also, much of the research in machine learning has been conducted through 
open source, allowing researchers to build on the work of others, thereby further 
accelerating the rate of progress in the field. 


Check for 
updates 


Probabilities 


In almost every application of machine learning we have to deal with uncertainty. For 
example, a system that classifies images of skin lesions as benign or malignant can 
never in practice achieve perfect accuracy. We can distinguish between two kinds of 
uncertainty. The first is epistemic uncertainty (derived from the Greek word episteme 
meaning knowledge), sometimes called systematic uncertainty. It arises because we 
only get to see data sets of finite size. As we observe more data, for instance more 
examples of benign and malignant skin lesion images, we are better able to predict 
the class of a new example. However, even with an infinitely large data set, we would 
still not be able to achieve perfect accuracy due to the second kind of uncertainty 
known as aleatoric uncertainty, also called intrinsic or stochastic uncertainty, or 
sometimes simply called noise. Generally speaking, the noise arises because we are 
able to observe only partial information about the world, and therefore, one way to 
reduce this source of uncertainty is to gather different kinds of data. This is illustrated 
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(a) (b) (c) 


Figure 2.1 An extension of the simple sine curve regression problem to two dimensions. (a) A plot of the 
function y(x1, z2) = sin(2rxı)sin(27rx2). Data is generated by selecting values for zı and x2, computing the 
corresponding value of y(1, 22), and then adding Gaussian noise. (b) Plot of 100 data points in which x2 is 


unobserved showing high levels of noise. (c) Plot of 100 data points in which xz is fixed to the value z2 = 4 


FPE 


simulating the effect of being able to measure x2 as well as xı, showing much lower levels of noise. 
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using an extension of the sine curve example to two dimensions in Figure 2.1. 

As a practical example of this, a biopsy sample of the skin lesion is much more 
informative than the image alone and might greatly improve the accuracy with which 
we can determine if a new lesion is malignant. Given both the image and the biopsy 
data, the intrinsic uncertainty might be very small, and by collecting a large training 
data set, we may be able to reduce the systematic uncertainty to a low level and 
thereby make predictions of the class of the lesion with high accuracy. 

Both kinds of uncertainty can be handled using the framework of probability 
theory, which provides a consistent paradigm for the quantification and manipula- 
tion of uncertainty and therefore forms one of the central foundations for machine 
learning. We will see that probabilities are governed by two simple formulae known 
as the sum rule and the product rule. When coupled with decision theory, these rules 
allow us, at least in principle, to make optimal predictions given all the information 
available to us, even though that information may be incomplete or ambiguous. 

The concept of probability is often introduced in terms of frequencies of repeat- 
able events. Consider, for example, the bent coin shown in Figure 2.2, and suppose 
that the shape of the coin is such that if it is flipped a large number of times, it lands 
concave side up 60% of the time, and therefore lands convex side up 40% of the 
time. We say that the probability of landing concave side up is 60% or 0.6. Strictly, 
the probability is defined in the limit of an infinite number of ‘trials’ or coin flips 
in this case. Because the coin must land either concave side up or convex side up, 
these probabilities add to 100% or 1.0. This definition of probability in terms of the 
frequency of repeatable events is the basis for the frequentist view of statistics. 

Now suppose that, although we know that the probability that the coin will land 
concave side up is 0.6, we are not allowed to look at the coin itself and we do not 
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ther as a frequency associated 
with a repeatable event or as 
a quantification of uncertainty. 
A bent coin can be used to il- 
lustrate the difference, as dis- 
cussed in the text. 


60% 40% 


know which side is heads and which is tails. If asked to take a bet on whether the coin 
will land heads or tails when flipped, then symmetry suggests that our bet should be 
based on the assumption that the probability of seeing heads is 0.5, and indeed a 
more careful analysis shows that, in the absence of any additional information, this 
is indeed the rational choice. Here we are using probabilities in a more general sense 
than simply the frequency of events. Whether the convex side of the coin is heads or 
tails is not itself a repeatable event, it is simply unknown. The use of probability as a 
quantification of uncertainty is the Bayesian perspective and is more general in that 
it includes frequentist probability as a special case. We can learn about which side 
of the coin is heads if we are given results from a sequence of coin flips by making 
use of Bayesian reasoning. The more results we observe, the lower our uncertainty 
as to which side of the coin is which. 

Having introduced the concept of probability informally, we turn now to a more 
detailed exploration of probabilities and discuss how to use them quantitatively. Con- 
cepts developed in the remainder of this chapter will form a core foundation for many 
of the topics discussed throughout the book. 


The Rules of Probability 


In this section we will derive two simple rules that govern the behaviour of proba- 
bilities. However, in spite of their apparent simplicity, these rules will prove to be 
very powerful and widely applicable. We will motivate the rules of probability by 
first introducing a simple example. 


2.1.1 A medical screening example 


Consider the problem of screening a population in order to provide early detec- 
tion of cancer, and let us suppose that 1% of the population actually have cancer. 
Ideally our test for cancer would give a positive result for anyone who has cancer 
and a negative result for anyone who does not. However, tests are not perfect, so 
we will suppose that when the test is given to people who are free of cancer, 3% of 
them will test positive. These are known as false positives. Similarly, when the test 
is given to people who do have cancer, 10% of them will test negative. These are 
called false negatives. The various error rates are illustrated in Figure 2.3. 

Given this information, we might ask the following questions: (1) ‘If we screen 
the population, what is the probability that someone will test positive?’, (2) ‘If some- 
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Figure 2.3 


Illustration of the accuracy of 
a cancer test. Out of ev- 
ery hundred people taking the 
test who do not have cancer, 
shown on the left, on average 
three will test positive. For 
those who have cancer, shown 
on the right, out of every hun- 
dred people taking the test, on 
average 90 will test positive. 
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one receives a positive test result, what is the probability that they actually have can- 
cer?’. We could answer such questions by working through the cancer screening case 
in detail. Instead, however, we will pause our discussion of this specific example and 
first derive the general rules of probability, known as the sum rule of probability and 
the product rule. We will then illustrate the use of these rules by answering our two 
questions. 


2.1.2 The sum and product rules 


To derive the rules of probability, consider the slightly more general example 
shown in Figure 2.4 involving two variables X and Y. In our cancer example, X 
could represent the presence or absence of cancer, and Y could be a variable de- 
noting the outcome of the test. Because the values of these variables can vary from 
one person to another in a way that is generally unknown, they are called random 
variables or stochastic variables. We will suppose that X can take any of the values 
x; where i = 1,..., L and that Y can take the values y; where 7 = 1,..., M. Con- 
sider a total of N trials in which we sample both of the variables X and Y, and let 
the number of such trials in which X = x; and Y = y; be n;j. Also, let the number 
of trials in which X takes the value x; (irrespective of the value that Y takes) be 
denoted by c;, and similarly let the number of trials in which Y takes the value y; be 
denoted by rj. 

The probability that X will take the value x; and Y will take the value y; is 
written p(X = 2;,Y = yj) and is called the joint probability of X = x; and 
Y = yj. It is given by the number of points falling in the cell 7,7 as a fraction of the 
total number of points, and hence 
Nig 


yj) = N` (2.1) 


p(X = xt;Y = 


Here we are implicitly considering the limit N — oo. Similarly, the probability that 
X takes the value z; irrespective of the value of Y is written as p(X = 2;) and is 
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Figure 2.4 We can derive the sum and product rules 
of probability by considering a random vari- 
able X, which takes the values {x;} where 


i = 1,..., L, and a second random variable 
Y, which takes the values {y;} where j = 
1,..., M. In this illustration, we have L = 5 


and M = 3. If we consider the total number 
N of instances of these variables, then we de- 
note the number of instances where X = zi 
and Y = y; by nij, which is the number of in- 
stances in the corresponding cell of the array. 
The number of instances in column i, corre- 
sponding to X = x;, is denoted by c;, and the 
number of instances in row j, corresponding 
to Y = y;, is denoted by rj. 


given by the fraction of the total number of points that fall in column 2, so that 


Ci 
Z=, 2.2 
p(X = ai) = 7 (2.2) 
Since )7, c; = N, we see that 
L 
Y pX =a) =1 (2.3) 
i=l 


and, hence, the probabilities sum to one as required. Because the number of instances 
in column 7 in Figure 2.4 is just the sum of the number of instances in each cell of 
that column, we have c; = )> 4 Mag and therefore, from (2.1) and (2.2), we have 


M 
pe =a) = ¥ px =o,Y =a) (2.4) 


j=l 


which is the sum rule of probability. Note that p(X = ;) is sometimes called the 
marginal probability and is obtained by marginalizing, or summing out, the other 
variables (in this case Y). 

If we consider only those instances for which X = 2;, then the fraction of 
such instances for which Y = y; is written p(Y = y;|X = 2;) and is called the 
conditional probability of Y = y; given X = z;. It is obtained by finding the 
fraction of those points in column 2 that fall in cell 7,7 and, hence, is given by 


Nij 
pY = yj|X = z;) = z (2.5) 
Summing both sides over j and using X` 7 Mig = Ci, We obtain 
M 
y pY =y =)= (2.6) 


j=1 


28 


2. PROBABILITIES 


showing that the conditional probabilities are correctly normalized. From (2.1), 
(2.2), and (2.5), we can then derive the following relationship: 


Nij Nij Ci 
N Ci N 
pY = y| X =a) pr’ = z;), (2.7) 


p(X = x; Y = yj) 


which is the product rule of probability. 

So far, we have been quite careful to make a distinction between a random vari- 
able, such as X, and the values that the random variable can take, for example x;. 
Thus, the probability that X takes the value x; is denoted p(X = 2;). Although 
this helps to avoid ambiguity, it leads to a rather cumbersome notation, and in many 
cases there will be no need for such pedantry. Instead, we may simply write p(X) to 
denote a distribution over the random variable X, or p(x;) to denote the distribution 
evaluated for the particular value x;, provided that the interpretation is clear from the 
context. 

With this more compact notation, we can write the two fundamental rules of 
probability theory in the following form: 


sum rule p(X) =X (X,Y) (2.8) 
Y 


product rule p(X, Y) =p(Y|X)p(X). (2.9) 


Here p(X,Y ) is a joint probability and is verbalized as ‘the probability of X and 
Y’. Similarly, the quantity p(Y |X) is a conditional probability and is verbalized as 
‘the probability of Y given X’. Finally, the quantity p(X) is a marginal probability 
and is simply ‘the probability of X’. These two simple rules form the basis for all of 
the probabilistic machinery that we will use throughout this book. 


2.1.3 Bayes’ theorem 


From the product rule, together with the symmetry property p(X, Y) = p(Y, X), 
we immediately obtain the following relationship between conditional probabilities: 


P(X|Y)p(¥) 


Pp(Y|X) = mx) 


(2.10) 
which is called Bayes’ theorem and which plays an important role in machine learn- 
ing. Note how Bayes’ theorem relates the conditional distribution p(Y |X) on the 
left-hand side of the equation, to the ‘reversed’ conditional distribution p(X|Y ) on 
the right-hand side. Using the sum rule, the denominator in Bayes’ theorem can be 
expressed in terms of the quantities appearing in the numerator: 


p(X) = X p(X|Y)p(Y). (2.11) 
Y 


Thus, we can view the denominator in Bayes’ theorem as being the normalization 
constant required to ensure that the sum over the conditional probability distribution 
on the left-hand side of (2.10) over all values of Y equals one. 
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p(X,Y) pY) 
Y=2 . 
o 
X 
p(X) p(X|¥ =1) 
X X 


Figure 2.5 An illustration of a distribution over two variables, X, which takes nine possible values, and Y, 
which takes two possible values. The top left figure shows a sample of 60 points drawn from a joint probability 
distribution over these variables. The remaining figures show histogram estimates of the marginal distributions 
p(X) and p(Y), as well as the conditional distribution p(X|Y = 1) corresponding to the bottom row in the top left 
figure. 


In Figure 2.5, we show a simple example involving a joint distribution over two 
variables to illustrate the concept of marginal and conditional distributions. Here a 
finite sample of N = 60 data points has been drawn from the joint distribution and 
is shown in the top left. In the top right is a histogram of the fractions of data points 
having each of the two values of Y. From the definition of probability, these frac- 
tions would equal the corresponding probabilities p(Y ) in the limit when the sample 
size N — oo. We can view the histogram as a simple way to model a probability 
Section 3.5.1 distribution given only a finite number of points drawn from that distribution. The 
remaining two plots in Figure 2.5 show the corresponding histogram estimates of 
p(X) and p(X |Y = 1). 
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2.1.4 Medical screening revisited 


Let us now return to our cancer screening example and apply the sum and prod- 
uct rules of probability to answer our two questions. For clarity, when working 
through this example, we will once again be explicit about distinguishing between 
the random variables and their instantiations. We will denote the presence or absence 
of cancer by the variable C, which can take two values: C = 0 corresponds to ‘no 
cancer’ and C = 1 corresponds to ‘cancer’. We have assumed that one person in a 
hundred in the population has cancer, and so we have 


p(C=1) = 1/100 (2.12) 
p(C=0) = 99/100, (2.13) 


respectively. Note that these satisfy p(C = 0) + p(C = 1) =1. 

Now let us introduce a second random variable T representing the outcome of a 
screening test, where T = 1 denotes a positive result, indicative of cancer, and T = 0 
a negative result, indicative of the absence of cancer. As illustrated in Figure 2.3, we 
know that for those who have cancer the probability of a positive test result is 90%, 
while for those who do not have cancer the probability of a positive test result is 3%. 
We can therefore write out all four conditional probabilities: 


p(T =1\C=1) = 90/100 (2.14) 
p(T =0|C=1) = 10/100 (2.15) 
p(T =1|C=0) = 3/100 (2.16) 
p(T = 0|C =0) 97/100. (2.17) 
Again, note that these probabilities are normalized so that 
p(T =1|C=1)+p(T =0|C=1)=1 (2.18) 
and similarly 
p(T =1|C =0)+p(T =0|C =0)=1. (2.19) 


We can now use the sum and product rules of probability to answer our first 
question and evaluate the overall probability that someone who is tested at random 
will have a positive test result: 


p(T = 1) p(T = 1|C = 0)p(C = 0) + p(T = 1|[C = 1)p(C = 1) 
3 99 90 1 887 
100 100 ' 100 ^ 100 10,000 


= 0.0387. (2.20) 


We see that if a person is tested at random there is a roughly 4% chance that the 
test will be positive even though there is a 1% chance that they actually have cancer. 
From this it follows, using the sum rule, that p(T = 0) = 1 — 387/10,000 = 
9613/10, 000 = 0.9613 and, hence, there is a roughly 96% chance that the do not 
have cancer. 

Now consider our second question, which is the one that is of particular interest 
to a person being screened: if a test is positive, what is the probability that the person 
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has cancer? This requires that we evaluate the probability of cancer conditional 
on the outcome of the test, whereas the probabilities in (2.14) to (2.17) give the 
probability distribution over the test outcome conditioned on whether the person has 
cancer. We can solve the problem of reversing the conditional probability by using 
Bayes’ theorem (2.10) to give 


p(T =1|C = 1)p(C = 1) 

pOSirs1) = (2.21) 
l | p(T = 1) 

90 M 1 M 10,000 90 

100 ^ 100 387 387 


so that if a person is tested at random and the test is positive, there is a 23% proba- 
bility that they actually have cancer. From the sum rule, it then follows that p(C = 
OJT = 1) = 1 — 90/387 = 297/387 ~ 0.77, which is a 77% chance that they do not 
have cancer. 


~ 0.23 (2.22) 


2.1.5 Prior and posterior probabilities 


We can use the cancer screening example to provide an important interpretation 
of Bayes’ theorem as follows. If we had been asked whether someone is likely to 
have cancer, before they have received a test, then the most complete information we 
have available is provided by the probability p(C’). We call this the prior probability 
because it is the probability available before we observe the result of the test. Once 
we are told that this person has received a positive test, we can then use Bayes’ theo- 
rem to compute the probability p(C|T), which we will call the posterior probability 
because it is the probability obtained after we have observed the test result T. 

In this example, the prior probability of having cancer is 1%. However, once we 
have observed that the test result is positive, we find that the posterior probability of 
cancer is now 23%, which is a substantially higher probability of cancer, as we would 
intuitively expect. We note, however, that a person with a positive test still has only a 
23% change of actually having cancer, even though the test appears, from Figure 2.3 
to be reasonably ‘accurate’. This conclusion seems counter-intuitive to many people. 
The reason has to do with the low prior probability of having cancer. Although 
the test provides strong evidence of cancer, this has to be combined with the prior 
probability using Bayes’ theorem to arrive at the correct posterior probability. 


2.1.6 Independent variables 


Finally, if the joint distribution of two variables factorizes into the product of the 
marginals, so that p(X, Y) = p(X)p(Y), then X and Y are said to be independent. 
An example of independent events would be the successive flips of a coin. From 
the product rule, we see that p(Y|X) = p(Y), and so the conditional distribution 
of Y given X is indeed independent of the value of X. In our cancer screening 
example, if the probability of a positive test is independent of whether the person has 
cancer, then p(T|C) = p(T), which means that from Bayes’ theorem (2.10) we have 
p(C|T) = p(C), and therefore probability of cancer is not changed by observing the 
test outcome. Of course, such a test would be useless because the outcome of the 
test tells us nothing about whether the person has cancer. 
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Figure 2.6 


2.2. 


The concept of probability for 
discrete variables can be ex- 
tended to that of a probabil- 
ity density p(x) over a contin- 
uous variable x and is such 
that the probability of x lying 
in the interval (x,2 + da) is 
given by p(x)dx for da — 0. 
The probability density can 
be expressed as the deriva- 
tive of a cumulative distribu- 
tion function P(x). 


6x x 


Probability Densities 


As well as considering probabilities defined over discrete sets of values, we also 
wish to consider probabilities with respect to continuous variables. For instance, we 
might wish to predict what dose of drug to give to a patient. Since there will be 
uncertainty in this prediction, we want to quantify this uncertainty and again we can 
make use of probabilities. However, we cannot simply apply the concepts of proba- 
bility discussed so far directly, since the probability of observing a specific value for 
a continuous variable, to infinite precision, will effectively be zero. Instead, we need 
to introduce the concept of a probability density. Here we will limit ourselves to a 
relatively informal discussion. 

We define the probability density p(x) over a continuous variable x to be such 
that the probability of x falling in the interval (x, x + dx) is given by p(x)da for 
dx — 0. This is illustrated in Figure 2.6. The probability that x will lie in an interval 


(a,b) is then given by 
b 


p(x € (a,b)) -| p(x) da. (2.23) 


a 


Because probabilities are non-negative, and because the value of x must lie some- 
where on the real axis, the probability density p(x) must satisfy the two conditions 


p(x) > 0 (2.24) 


T p(a) dx = 1. (2.25) 


—oco 


The probability that x lies in the interval (—oo, z) is given by the cumulative 
distribution function defined by 


P(z) = L p(x) da, (2.26) 


—co 


2.2. Probability Densities 33 


which satisfies P'(x) = p(x), as shown in Figure 2.6. 

If we have several continuous variables x1, ..., £p, denoted collectively by the 
vector x, then we can define a joint probability density p(x) = p(z1,..., £p) such 
that the probability of x falling in an infinitesimal volume ôx containing the point x 
is given by p(x)ôx. This multivariate probability density must satisfy 


p(x) > 0 (2.27) 
fra = 1 (2.28) 


in which the integral is taken over the whole of x space. More generally, we can also 
consider joint probability distributions over a combination of discrete and continuous 
variables. 

The sum and product rules of probability, as well as Bayes’ theorem, also apply 
to probability densities as well as to combinations of discrete and continuous vari- 
ables. If x and y are two real variables, then the sum and product rules take the 
form 


sum rule p(x) = [vey dy (2.29) 


product rule p(x, y) = p(y|x)p(x). (2.30) 


Similarly, Bayes’ theorem can be written in the form 


p(ylx) = eel iy) ) (2.31) 


where the denominator is given by 


p(x) = f p(xly)p(y) dy. (2.32) 


A formal justification of the sum and product rules for continuous variables re- 
quires a branch of mathematics called measure theory (Feller, 1966) and lies outside 
the scope of this book. Its validity can be seen informally, however, by dividing each 
real variable into intervals of width A and considering the discrete probability dis- 
tribution over these intervals. Taking the limit A — 0 then turns sums into integrals 
and gives the desired result. 


2.2.1 Example distributions 


There are many forms of probability density that are in widespread use and 
that are important both in their own right and as building blocks for more complex 
probabilistic models. The simplest form would be one in which p(x) is a constant, 
independent of x, but this cannot be normalized because the integral in (2.28) will 
be divergent. Distributions that cannot be normalized are called improper. We can, 
however, have the uniform distribution that is constant over a finite region, say (c, d), 
and zero elsewhere, in which case (2.28) implies 


p(x) =1/(d—c), «2 € (c,d). (2.33) 
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Figure 2.7 Plots of a uniform distribution over 
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the range (—1,1), shown in red, 
the exponential distribution with 
A = 1, shown in blue, and a 
Laplace distribution with u = 1 
and = 1, shown in green. 


Another simple form of density is the exponential distribution given by 
p(2|A) = Aexp(-Az), «>0. (2.34) 


A variant of the exponential distribution, known as the Laplace distribution, allows 
the peak to be moved to a location u and is given by 


1 L— pb 
plz|u, y) = — exp ( | ) f (2.35) 
27 y 
The constant, exponential, and Laplace distributions are illustrated in Figure 2.7. 
Another important distribution is the Dirac delta function, which is written 


p(z|u) = ô(z — p). (2.36) 


This is defined to be zero everywhere except at x = pu and to have the property of in- 
tegrating to unity according to (2.28). Informally, we can think of this as an infinitely 
narrow and infinitely tall spike located at x = u with the property of having unit area. 
Finally, if we have a finite set of observations of x given by D = {x1,..., £y } then 
we can use the delta function to construct the empirical distribution given by 


p(a|D) = > 5(@ — an), (2.37) 


which consists of a Dirac delta function centred on each of the data points. The 
probability density defined by (2.37) integrates to one as required. 


2.2.2 Expectations and covariances 


One of the most important operations involving probabilities is that of finding 
weighted averages of functions. The weighted average of some function f(x) under 
a probability distribution p(x) is called the expectation of f(x) and will be denoted 
by E[f]. For a discrete distribution, it is given by summing over all possible values 


of x in the form 
=) pea) (2.38) 
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where the average is weighted by the relative probabilities of the different values of 
x. For continuous variables, expectations are expressed in terms of an integration 
with respect to the corresponding probability density: 


E[f] = [ror dz. (2.39) 


In either case, if we are given a finite number N of points drawn from the probability 
distribution or probability density, then the expectation can be approximated as a 
finite sum over these points: 


itl 
ELA] ~ F DL flan). (2.40) 


The approximation in (2.40) becomes exact in the limit N — oo. 

Sometimes we will be considering expectations of functions of several variables, 
in which case we can use a subscript to indicate which variable is being averaged 
over, so that for instance 


Eel f(x, y)] (2.41) 


denotes the average of the function f(x, y) with respect to the distribution of x. Note 
that E,,[f (a, y)] will be a function of y. 

We can also consider a conditional expectation with respect to a conditional 
distribution, so that 


Eelflyl = X p(aly) f(x), (2.42) 
which is also a function of y. For continuous variables, the conditional expectation 
takes the form 

Bolflul = f plaly)s(e) ae. (2.43) 
The variance of f(x) is defined by 
var[f] = E [(f(2) - EIF)? (2.44) 


and provides a measure of how much f(x) varies around its mean value E[f()]. 
Expanding out the square, we see that the variance can also be written in terms of 
the expectations of f(x) and f(a): 


var[f] = E[f(x)"] — E[f(2))’. (2.45) 
In particular, we can consider the variance of the variable x itself, which is given by 
var[z] = E[x?] — E[z]?. (2.46) 


For two random variables x and y, the covariance measures the extent to which 
the two variables vary together and is defined by 
cov[z,y] = Exy He- Elz]} {y — Ely]}] 
= Ez y|ry] — Ela|Ely]. (2.47) 
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Figure 2.8 
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Section 3.2 


2.3. 


Plot of a Gaussian distribu- 

tion for a single continuous N (z|u, 07) 
variable x showing the mean 

p and the standard deviation 

Ga 


If x and y are independent, then their covariance equals zero. 
For two vectors x and y, their covariance is a matrix given by 


cov[x,y] = Exy |{x- Elx] Hy" - E[ly"]} 
= Ex y[xy"] —E[x]E[y’]. (2.48) 


If we consider the covariance of the components of a vector x with each other, then 
we use a slightly simpler notation cov[x] = cov[x, x]. 


The Gaussian Distribution 


One of the most important probability distributions for continuous variables is called 
the normal or Gaussian distribution, and we will make extensive use of this distribu- 
tion throughout the rest of the book. For a single real-valued variable x, the Gaussian 
distribution is defined by 


1 1 
N (zlu, a’) = (Qr02)1/2 exp { 202 (x wh, (2.49) 


which represents a probability density over x governed by two parameters: ju, called 
the mean, and o?, called the variance. The square root of the variance, given by 
a, is called the standard deviation, and the reciprocal of the variance, written as 
B = 1/07, is called the precision. We will see the motivation for this terminology 
shortly. Figure 2.8 shows a plot of the Gaussian distribution. Although the form 
of the Gaussian distribution might seem arbitrary, we will see later that it arises 
naturally from the concept of maximum entropy and from the perspective of the 
central limit theorem. 
From (2.49) we see that the Gaussian distribution satisfies 


N (z|u, 07) > 0. (2.50) 
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Also, it is straightforward to show that the Gaussian is normalized, so that 
/ N (zu, 07) dz = 1. (2.51) 
Thus, (2.49) satisfies the two requirements for a valid probability density. 


2.3.1 Mean and variance 


We can readily find expectations of functions of x under the Gaussian distribu- 
tion. In particular, the average value of x is given by 


Efx] = I N (x|, 0°) zde = p. (2.52) 


—Co 


Because the parameter u represents the average value of x under the distribution, it 
is referred to as the mean. The integral in (2.52) is known as the first-order moment 
of the distribution because it is the expectation of x raised to the power one. We can 
similarly evaluate the second-order moment given by 


E[x?] = f N (z|, 0°) 2? dx = p? + 0°. (2.53) 


— o 


From (2.52) and (2.53), it follows that the variance of x is given by 


var|z] = E[x?] — E[z]? = o° (2.54) 


and hence øg? is referred to as the variance parameter. The maximum of a distribution 
is known as its mode. For a Gaussian, the mode coincides with the mean. 


2.3.2 Likelihood function 


Suppose that we have a data set of observations represented as a row vector 
X = (z1,..., Zy), representing N observations of the scalar variable x. Note that 
we are using the typeface X to distinguish this from a single observation of a D- 
dimensional vector-valued variable, which we represent by a column vector x = 
(a1,...,@p)*. We will suppose that the observations are drawn independently from 
a Gaussian distribution whose mean 1 and variance g? are unknown, and we would 
like to determine these parameters from the data set. The problem of estimating a 
distribution, given a finite set of observations, is known as density estimation. It 
should be emphasized that the problem of density estimation is fundamentally ill- 
posed, because there are infinitely many probability distributions that could have 
given rise to the observed finite data set. Indeed, any distribution p(x) that is non- 
zero at each of the data points x,,..., Xy is a potential candidate. Here we constrain 
the space of distributions to be Gaussians, which leads to a well-defined solution. 

Data points that are drawn independently from the same distribution are said to 
be independent and identically distributed, which is often abbreviated to ii.d. or 
IID. We have seen that the joint probability of two independent events is given by 
the product of the marginal probabilities for each event separately. Because our data 
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Figure 2.9 
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Illustration of the likelinood func- 
tion for the Gaussian distribution 
shown by the red curve. Here 
the grey points denote a data set 
of values {xn }, and the likelihood 
function (2.55) is given by the 
product of the corresponding val- 
ues of p(x) denoted by the blue 
points. Maximizing the likelinood N (an|u, 07) 
involves adjusting the mean and 

variance of the Gaussian so as to 

maximize this product. 


p(x) 


Tn 


set X is i.i.d., we can therefore write the probability of the data set, given u and o°, 
in the form 


N 
p(x|u, 07) = II N (zn|u, 0°). (2.55) 


When viewed as a function of u and o°, this is called the likelihood function for the 
Gaussian and is interpreted diagrammatically in Figure 2.9. 

One common approach for determining the parameters in a probability distribu- 
tion using an observed data set, known as maximum likelihood, is to find the param- 
eter values that maximize the likelihood function. This might appear to be a strange 
criterion because, from our foregoing discussion of probability theory, it would seem 
more natural to maximize the probability of the parameters given the data, not the 
probability of the data given the parameters. In fact, these two criteria are related. 

To start with, however, we will determine values for the unknown parameters u 
and o° in the Gaussian by maximizing the likelihood function (2.55). In practice, 
it is more convenient to maximize the log of the likelihood function. Because the 
logarithm is a monotonically increasing function of its argument, maximizing the 
log of a function is equivalent to maximizing the function itself. Taking the log not 
only simplifies the subsequent mathematical analysis, but it also helps numerically 
because the product of a large number of small probabilities can easily underflow the 
numerical precision of the computer, and this is resolved by computing the sum of 
the log probabilities instead. From (2.49) and (2.55), the log likelihood function can 
be written in the form 


N 
1 N N 
Inp (x|u, o°) = oo 2 7 Ino? 3 In(27). (2.56) 


Maximizing (2.56) with respect to js, we obtain the maximum likelihood solution 
given by 
ix 
UML = y Dns (2.57) 


n=1 
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which is the sample mean, i.e., the mean of the observed values {x,,}. Similarly, 
maximizing (2.56) with respect to 0”, we obtain the maximum likelihood solution 
for the variance in the form 
ix 
Oh, = py Da (tn — AML)’, (2.58) 
n=1 
which is the sample variance measured with respect to the sample mean umr. Note 
that we are performing a joint maximization of (2.56) with respect to u and o°, but 
for a Gaussian distribution, the solution for u decouples from that for a? so that we 
can first evaluate (2.57) and then subsequently use this result to evaluate (2.58). 


2.3.3 Bias of maximum likelihood 


The technique of maximum likelihood is widely used in deep learning and forms 
the foundation for most machine learning algorithms. However, it has some limita- 
tions, which we can illustrate using a univariate Gaussian. 

We first note that the maximum likelihood solutions juyyr, and ofr, are functions 
of the data set values x1,..., £n. Suppose that each of these values has been gen- 
erated independently from a Gaussian distribution whose true parameters are u and 
a°. Now consider the expectations of pmi and o%,;, with respect to these data set 
values. It is straightforward to show that 


Elum] = pw (2.59) 
(=) a’. (2.60) 


We see that, when averaged over data sets of a given size, the maximum likelihood 
solution for the mean will equal the true mean. However, the maximum likelihood 
estimate of the variance will underestimate the true variance by a factor (N — 1)/N. 
This is an example of a phenomenon called bias in which the estimator of a random 
quantity is systematically different from the true value. The intuition behind this 
result is given by Figure 2.10. 

Note that bias arises because the variance is measured relative to the maximum 
likelihood estimate of the mean, which itself is tuned to the data. Suppose instead 
we had access to the true mean u and we used this to determine the variance using 
the estimator 


Elom] 


2| 


1 N 
= ` (£n — u}. (2.61) 
n=1 


Then we find that 


E [6°] = o?, (2.62) 
which is unbiased. Of course, we do not have access to the true mean but only 
to the observed data values. From the result (2.60) it follows that for a Gaussian 
distribution, the following estimate for the variance parameter is unbiased: 


N 1 < 


n=1 


40 2. PROBABILITIES 


Figure 2.10 
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Illustration of how bias arises when using maximum likelihood to determine the mean and variance 


of a Gaussian. The red curves show the true Gaussian distribution from which data is generated, and the three 
blue curves show the Gaussian distributions obtained by fitting to three data sets, each consisting of two data 
points shown in green, using the maximum likelihood results (2.57) and (2.58). Averaged across the three data 
sets, the mean is correct, but the variance is systematically underestimated because it is measured relative to 
the sample mean and not relative to the true mean. 


Section 2.6.3 


Section 1.2 


Correcting for the bias of maximum likelihood in complex models such as neural 
networks is not so easy, however. 

Note that the bias of the maximum likelihood solution becomes less significant 
as the number N of data points increases. In the limit N — oo the maximum 
likelihood solution for the variance equals the true variance of the distribution that 
generated the data. In the case of the Gaussian, for anything other than small JN, this 
bias will not prove to be a serious problem. However, throughout this book we will 
be interested in complex models with many parameters, for which the bias problems 
associated with maximum likelihood will be much more severe. In fact, the issue of 
bias in maximum likelihood is closely related to the problem of over-fitting. 


2.3.4 Linear regression 


We have seen how the problem of linear regression can be expressed in terms of 
error minimization. Here we return to this example and view it from a probabilistic 
perspective, thereby gaining some insights into error functions and regularization. 

The goal in the regression problem is to be able to make predictions for the 
target variable t given some new value of the input variable x by using a set of 
training data comprising NV input values X = (z1, ..., £y) and their corresponding 
target values t = (ti,...,t). We can express our uncertainty over the value of the 
target variable using a probability distribution. For this purpose, we will assume that, 
given the value of x, the corresponding value of t has a Gaussian distribution with 
a mean equal to the value y(x, w) of the polynomial curve given by (1.1), where w 
are the polynomial coefficients, and a variance 0”. Thus, we have 


p(t\|z,w,o”?) =N (tly(z, w), o°) (2.64) 


This is illustrated schematically in Figure 2.11. 
We now use the training data {X,t} to determine the values of the unknown 
parameters w and o? by maximum likelihood. If the data is assumed to be drawn 
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Figure 2.11 Schematic illustration of a Gaus- t 
sian conditional distribution for t 
given «x, defined by (2.64), in 
which the mean is given by the 
polynomial function y(x, w), and 
the variance is given by the pa- 
rameter o°. 


p(t\xo, w, o°) 


To 
independently from the distribution (2.64), then the likelihood function is given by 


N 
p(t|x, w, o°) = |’ (trly(@n,W), o°). (2.65) 


n=1 


As we did for the simple Gaussian distribution earlier, it is convenient to maximize 
the logarithm of the likelihood function. Substituting for the Gaussian distribution, 
given by (2.49), we obtain the log likelihood function in the form 


1 Č 3 N 

In p(t|x, w, o°) = -302 2 {y(£n, W) — tn} 3 lno J In(27). (2.66) 
Consider first the evaluation of the maximum likelihood solution for the polynomial 
coefficients, which will be denoted by wm. These are determined by maximizing 
(2.66) with respect to w. For this purpose, we can omit the last two terms on the 
right-hand side of (2.66) because they do not depend on w. Also, note that scaling 
the log likelihood by a positive constant coefficient does not alter the location of the 
maximum with respect to w, and so we can replace the coefficient 1/20? with 1/2. 
Finally, instead of maximizing the log likelihood, we can equivalently minimize the 
negative log likelihood. We therefore see that maximizing the likelihood is equiva- 
lent, so far as determining w is concerned, to minimizing the sum-of-squares error 
function defined by 


N 
E(w) = l `> {y(an,w) = tn}. (2.67) 


Thus, the sum-of-squares error function has arisen as a consequence of maximizing 
the likelihood under the assumption of a Gaussian noise distribution. 
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2.4. 


We can also use maximum likelihood to determine the variance parameter o°. 
Maximizing (2.66) with respect to o? gives 


N 
1 
OME = N 2 {y(2n, WML) — tn}. (2.68) 
n=1 


Note that we can first determine the parameter vector wm, governing the mean, 
and subsequently use this to find the variance o%;;, as was the case for the simple 
Gaussian distribution. 

Having determined the parameters w and o?, we can now make predictions for 
new values of x. Because we now have a probabilistic model, these are expressed 
in terms of the predictive distribution that gives the probability distribution over t, 
rather than simply a point estimate, and is obtained by substituting the maximum 
likelihood parameters into (2.64) to give 


p(t|z, WML, OX) = NV (tly(x, Wa); x) ; (2.69) 


Transformation of Densities 


We turn now to a discussion of how a probability density transforms under a nonlin- 
ear change of variable. This property will play a crucial role when we discuss a class 
of generative models called normalizing flows. It also highlights that a probability 
density has a different behaviour than a simple function under such transformations. 

Consider a single variable x and suppose we make a change of variables x = 


g(y), then a function f(x) becomes a new function fly) defined by 


f(y) = f(g(y))- (2.70) 


Now consider a probability density p,(x), and again change variables using x = 
g(y), giving rise to a density p,(y) with respect to the new variable y, where the 
suffixes denote that ps (x) and p,(y) are different densities. Observations falling in 
the range (z,x + ôx) will, for small values of ôx, be transformed into the range 
(y,y + ôy), where x = g(y), and p,(x)dz ~ p,(y)dy. Hence, if we take the limit 
ôx — 0, we obtain 


Il 
3 
8 
B 


Py(y) 
= pzx(g(y)) F : (2.71) 


Here the modulus |-| arises because the derivative dy/ dx could be negative, whereas 
the density is scaled by the ratio of lengths, which is always positive. 


Exercise 2.19 
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This procedure for transforming densities can be very powerful. Any density 
p(y) can be obtained from a fixed density q(x) that is everywhere non-zero by mak- 
ing a nonlinear change of variable y = f(a) in which f(x) is a monotonic function 
so that 0 < f'(x) < oo. 

One consequence of the transformation property (2.71) is that the concept of the 
maximum of a probability density is dependent on the choice of variable. Suppose 
f(x) has a mode (i.e., a maximum) at % so that f’(Z) = 0. The corresponding mode 


of f(y) will occur for a value 7 obtained by differentiating both sides of (2.70) with 
respect to y: 

f@) = f(g@)9'@ = 9. (2.72) 
Assuming g'(y) # 0 at the mode, then f’(g(v)) = 0. However, we know that 
f'(Z) = 0, and so we see that the locations of the mode expressed in terms of each 
of the variables x and y are related by = g(z), as one would expect. Thus, finding 
a mode with respect to the variable x is equivalent to first transforming to the variable 
y, then finding a mode with respect to y, and then transforming back to x. 

Now consider the behaviour of a probability density p,(a) under the change of 
variables x = g(y), where the density with respect to the new variable is p,(y) and 
is given by (2.71). To deal with the modulus in (2.71) we can write g’(y) = s|g’(y)| 
where s € {—1,+1}. Then (2.71) can be written as 


Py(y) = px(g(y))sg'(y) 


where we have used 1/s = s. Differentiating both sides with respect to y then gives 


Py) = sp.(9(y) UY + spe (9(y)) 9" (y). (2.73) 


Due to the presence of the second term on the right-hand side of (2.73), the rela- 
tionship T = g(y) no longer holds. Thus, the value of x obtained by maximizing 
P(x) will not be the value obtained by transforming to p,(y) then maximizing with 
respect to y and then transforming back to x. This causes modes of densities to be 
dependent on the choice of variables. However, for a linear transformation, the sec- 
ond term on the right-hand side of (2.73) vanishes, and so in this case the location of 
the maximum transforms according to Z = g(y). 

This effect can be illustrated with a simple example, as shown in Figure 2.12. We 
begin by considering a Gaussian distribution p(x) over x shown by the red curve 
in Figure 2.12. Next we draw a sample of N = 50,000 points from this distribution 
and plot a histogram of their values, which as expected agrees with the distribution 
p(x). Now consider a nonlinear change of variables from x to y given by 


x = g(y) = In(y) — In(1 — y) +5. (2.74) 
The inverse of this function is given by 


1 
~ 1+exp(—a2 +5)’ 


y =g (x) (2.75) 


which is a logistic sigmoid function and is shown in Figure 2.12 by the blue curve. 
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Figure 2.12 Example of the transformation of 
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the mode of a density under a 
nonlinear change of variables, il- 
lustrating the different behaviour 
compared to a simple function. 


If we simply transform p(x) as a function of x we obtain the green curve 
px(g(y)) shown in Figure 2.12, and we see that the mode of the density p,(x) is 
transformed via the sigmoid function to the mode of this curve. However, the den- 
sity over y transforms instead according to (2.71) and is shown by the magenta curve 
on the left side of the diagram. Note that this has its mode shifted relative to the mode 
of the green curve. 

To confirm this result, we take our sample of 50,000 values of x, evaluate the 
corresponding values of y using (2.75), and then plot a histogram of their values. We 
see that this histogram matches the magenta curve in Figure 2.12 and not the green 
curve. 


2.4.1 Multivariate distributions 


We can extend the result (2.71) to densities defined over multiple variables. Con- 
sider a density p(x) over a D-dimensional variable x = (1,..., 2p)", and suppose 
we transform to a new variable y = (y1,..., yp)" where x = g(y). Here we will 
limit ourselves to the case where x and y have the same dimensionality. The trans- 
formed density is then given by the generalization of (2.71) in the form 


Py(y) = px(x) |det J| (2.76) 


where J is the Jacobian matrix whose elements are given by the partial derivatives 
Jij = gi /Oy;, so that 


ðn Og 
ðyı  Oyp 

J= : a : (2.77) 
ðgo gp 
Oy, ` yp 


Intuitively, we can view the change of variables as expanding some regions of space 
and contracting others, with an infinitesimal region Ax around a point x being trans- 
formed to a region Ay around the point y = g(x). The absolute value of the deter- 
minant of the Jacobian represents the ratio of these volumes and is the same factor 
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Figure 2.13 Illustration of the effect of a change of variables on a probability distribution in two dimensions. 
The left column shows the transforming of the variables whereas the middle and right columns show the corre- 
sponding effects on a Gaussian distribution and on samples from that distribution, respectively. 


that arises when changing variables within an integral. The formula (2.77) follows 
from the fact that the probability mass in region Ax is the same as the probability 
mass in Ay. Once again, we take the modulus to ensure that the density is non- 
negative. 
We can illustrate this by applying a change of variables to a Gaussian distribution 
in two dimensions, as shown in the top row in Figure 2.13. Here the transformation 
Exercise 2.20 from x to y is given by 


yı = zı +tanh(5zı) (2.78) 
3 
y2 = T2 + tanh(5ir) +. (2.79) 


Also shown on the bottom row are samples from a Gaussian distribution in x-space 
along with the corresponding transformed samples in y-space. 
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2.5. 


Information Theory 


Probability theory forms the basis for another important framework called informa- 
tion theory, which quantifies the information present in a data set and which plays 
an important role in machine learning. Here we give a brief introduction to some of 
the key elements of information theory that we will need later in the book, including 
the important concept of entropy in its various forms. For a more comprehensive in- 
troduction to information theory, with connections to machine learning, see MacKay 
(2003). 


2.5.1 Entropy 


We begin by considering a discrete random variable x and we ask how much 
information is received when we observe a specific value for this variable. The 
amount of information can be viewed as the ‘degree of surprise’ on learning the 
value of x. If we are told that a highly improbable event has just occurred, we will 
have received more information than if we were told that some very likely event has 
just occurred, and if we knew that the event was certain to happen, we would receive 
no information. Our measure of information content will therefore depend on the 
probability distribution p(x), and so we look for a quantity h(x) that is a monotonic 
function of the probability p(x) and that expresses the information content. The form 
of h(-) can be found by noting that if we have two events x and y that are unrelated, 
then the information gained from observing both of them should be the sum of the 
information gained from each of them separately, so that h(x, y) = h(a) + h(y). 
Two unrelated events are statistically independent and so p(x, y) = p(x)p(y). From 
these two relationships, it is easily shown that h(x) must be given by the logarithm 
of p(x) and so we have 

h(x) = — log, p(x) (2.80) 


where the negative sign ensures that information is positive or zero. Note that low 
probability events x correspond to high information content. The choice of base for 
the logarithm is arbitrary, and for the moment we will adopt the convention prevalent 
in information theory of using logarithms to the base of 2. In this case, as we will 
see shortly, the units of h(x) are bits (“binary digits’). 

Now suppose that a sender wishes to transmit the value of a random variable to 
a receiver. The average amount of information that they transmit in the process is 
obtained by taking the expectation of (2.80) with respect to the distribution p(x) and 
is given by 


=~ Diva) ) logy p(x (2.81) 


This important quantity is called the entropy of the random variable x. Note that 
lime—o(cln €) = 0 and so we will take p(x) In p(x) = 0 whenever we encounter a 
value for x such that p(x) = 0. 

So far, we have given a rather heuristic motivation for the definition of informa- 
tion (2.80) and the corresponding entropy (2.81). We now show that these definitions 
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indeed possess useful properties. Consider a random variable x having eight possible 
states, each of which is equally likely. To communicate the value of x to a receiver, 
we would need to transmit a message of length 3 bits. Notice that the entropy of this 
variable is given by 
1 1 
H[z] = —8 x z 1082 ~ 3 bits. 

Now consider an example (Cover and Thomas, 1991) of a variable having eight 
possible states {a, b, c, d, e, f, g, h} for which the respective probabilities are given 


Ei ri 
by (5; D 5 T0 tot d). The entropy in this case is given by 


1 1 1 1 
H[z] = 5 082 3 — 7 1°82 7 


1 | 1 1 1 4 | 1 2 bi 

g 823 ig “16 64 8 Ga 
We see that the nonuniform distribution has a smaller entropy than the uniform one, 
and we will gain some insight into this shortly when we discuss the interpretation of 
entropy in terms of disorder. For the moment, let us consider how we would transmit 
the identity of the variable’s state to a receiver. We could do this, as before, using 
a 3-bit number. However, we can take advantage of the nonuniform distribution by 
using shorter codes for the more probable events, at the expense of longer codes 
for the less probable events, in the hope of getting a shorter average code length. 
This can be done by representing the states {a, b, c, d, e, f, g, h} using, for instance, 
the following set of code strings: 0, 10, 110, 1110, 111100, 111101, 111110, and 
111111. The average length of the code that has to be transmitted is then 


average code length = ; x14 7x2 : x34 a x4+4x a x 6 = 2 bits, 
which again is the same as the entropy of the random variable. Note that shorter code 
strings cannot be used because it must be possible to disambiguate a concatenation 
of such strings into its component parts. For instance, 11001110 decodes uniquely 
into the state sequence c, a, d. This relation between entropy and shortest coding 
length is a general one. The noiseless coding theorem (Shannon, 1948) states that 
the entropy is a lower bound on the number of bits needed to transmit the state of a 
random variable. 

From now on, we will switch to the use of natural logarithms in defining entropy, 
as this will provide a more convenient link with ideas elsewhere in this book. In this 
case, the entropy is measured in units of nats (from ‘natural logarithm’) instead of 
bits, which differ simply by a factor of ln 2. 


2.5.2 Physics perspective 


We have introduced the concept of entropy in terms of the average amount of 
information needed to specify the state of a random variable. In fact, the concept of 
entropy has much earlier origins in physics where it was introduced in the context 
of equilibrium thermodynamics and later given a deeper interpretation as a measure 
of disorder through developments in statistical mechanics. We can understand this 
alternative view of entropy by considering a set of N identical objects that are to be 
divided amongst a set of bins, such that there are n; objects in the ith bin. Consider 
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the number of different ways of allocating the objects to the bins. There are N 
ways to choose the first object, (M — 1) ways to choose the second object, and 
so on, leading to a total of N! ways to allocate all N objects to the bins, where N! 
(pronounced ‘N factorial’) denotes the product N x (N — 1) x- -x 2x 1. However, 
we do not wish to distinguish between rearrangements of objects within each bin. In 
the ith bin there are n;! ways of reordering the objects, and so the total number of 
ways of allocating the N objects to the bins is given by 


N! 
I], n! 
which is called the multiplicity. The entropy is then defined as the logarithm of the 
multiplicity scaled by a constant factor 1/N so that 


W = (2.82) 


1 1 1 
H= guwe gN (2.83) 
We now consider the limit N — oo, in which the fractions n;/N are held fixed, and 
apply Stirling’s approximation: 
InN!~ NinN—-N, (2.84) 


which gives 


n=- EGH- es 


where we have used X`; n; = N. Here p; = limy_,..(n;/N) is the probability of 
an object being assigned to the ith bin. In physics terminology, the specific allocation 
of objects into bins is called a microstate, and the overall distribution of occupation 
numbers, expressed through the ratios n,;/N, is called a macrostate. The multiplicity 
W, which expresses the number of microstates in a given macrostate, is also known 
as the weight of the macrostate. 

We can interpret the bins as the states x; of a discrete random variable X, where 
p(X = xi) = pi. The entropy of the random variable X is then 


H[p] = — D p(z) lIn p(x). (2.86) 


Distributions p(x;) that are sharply peaked around a few values will have a relatively 
low entropy, whereas those that are spread more evenly across many values will have 
higher entropy, as illustrated in Figure 2.14. 

Because 0 < p; < 1, the entropy is non-negative, and it will equal its minimum 
value of 0 when one of the p; = 1 and all other pjz; = 0. The maximum entropy 
configuration can be found by maximizing H using a Lagrange multiplier to enforce 
the normalization constraint on the probabilities. Thus, we maximize 


H=- dP) ln p(x) +A (= p(x) — 1) (2.87) 
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Figure 2.14 Histograms of two probability distributions over 30 bins illustrating the higher value of the entropy 
H for the broader distribution. The largest entropy would arise from a uniform distribution which would give 
H = —In(1/30) = 3.40. 


Exercise 2.22 
Exercise 2.23 


from which we find that all of the p(x;) are equal and are given by p(x;) = 1/M 
where M is the total number of states x;. The corresponding value of the entropy 
is then H = In M. This result can also be derived from Jensen’s inequality (to be 
discussed shortly). To verify that the stationary point is indeed a maximum, we can 
evaluate the second derivative of the entropy, which gives 
oH 1 
Op(x;)Op(x;) fi Pi an 


where J;; are the elements of the identity matrix. We see that these values are all 
negative and, hence, the stationary point is indeed a maximum. 


2.5.3 Differential entropy 


We can extend the definition of entropy to include distributions p(x) over con- 
tinuous variables x as follows. First divide x into bins of width A. Then, assuming 
that p(x) is continuous, the mean value theorem (Weisstein, 1999) tells us that, for 
each such bin, there must exist a value x; in the range iA < x; < (i+1)A such that 


(i+1)A 
f p(x) dz = p(z;)A. (2.89) 
iA 


We can now quantize the continuous variable x by assigning any value z to the value 
x; whenever z falls in the ith bin. The probability of observing the value x; is then 
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p(xi)A. This gives a discrete distribution for which the entropy takes the form 
Ha = — $ p(z:)Aln (p(ai)A) = -— X p(z:)Alnp(z:)-lnA (2.90) 


where we have used $`; p(a;)A = 1, which follows from (2.89) and (2.25). We now 
omit the second term — In A on the right-hand side of (2.90), since it is independent 
of p(x), and then consider the limit A — 0. The first term on the right-hand side of 
(2.90) will approach the integral of p(x) In p(x) in this limit so that 


lim {- So v(ai)A mote) == [ro In p(x) dx (2.91) 


A->0 


where the quantity on the right-hand side is called the differential entropy. We see 
that the discrete and continuous forms of the entropy differ by a quantity In A, which 
diverges in the limit A — 0. This reflects that specifying a continuous variable 
very precisely requires a large number of bits. For a density defined over multiple 
continuous variables, denoted collectively by the vector x, the differential entropy is 
given by 


H[x] = — fow In p(x) dx. (2.92) 


2.5.4 Maximum entropy 


We saw for discrete distributions that the maximum entropy configuration cor- 
responds to a uniform distribution of probabilities across the possible states of the 
variable. Let us now consider the corresponding result for a continuous variable. If 
this maximum is to be well defined, it will be necessary to constrain the first and 
second moments of p(x) and to preserve the normalization constraint. We therefore 
maximize the differential entropy with the three constraints: 


/ i plz)dr = 1 (2.93) 

zp(x)dz = pu (2.94) 

T (x —p)*p(x2)dz = o. (2.95) 

Appendix C The constrained maximization can be performed using Lagrange multipliers so that 


we maximize the following functional with respect to p(x): 


z f ro E (J` vo) dp 1) 
+ Azo ‘i xp(x) dx — n) + Az i (x — u)’ p(x) dx — d! (2.96) 


=09 — 00 
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Using the calculus of variations, we set the derivative of this functional to zero giving 


p(x) = exp { 1+ A1 + Aon + A3(x py y}. (2.97) 


The Lagrange multipliers can be found by back-substitution of this result into the 
three constraint equations, leading finally to the result: 


1 — 2 
plz) = PONE exp { ew l ; (2.98) 


and so the distribution that maximizes the differential entropy is the Gaussian. Note 
that we did not constrain the distribution to be non-negative when we maximized the 
entropy. However, because the resulting distribution is indeed non-negative, we see 
with hindsight that such a constraint is not necessary. 

If we evaluate the differential entropy of the Gaussian, we obtain 


Hla) = z {1+ In(2707)}. (2.99) 


Thus, we see again that the entropy increases as the distribution becomes broader, 
i.e., as o° increases. This result also shows that the differential entropy, unlike the 
discrete entropy, can be negative, because H(z) < 0 in (2.99) for o° < 1/(27e). 


2.5.5 Kullback—Leibler divergence 


So far in this section, we have introduced a number of concepts from informa- 
tion theory, including the key notion of entropy. We now start to relate these ideas 
to machine learning. Consider some unknown distribution p(x), and suppose that 
we have modelled this using an approximating distribution g(x). If we use q(x) to 
construct a coding scheme for transmitting values of x to a receiver, then the average 
additional amount of information (in nats) required to specify the value of x (assum- 
ing we choose an efficient coding scheme) as a result of using q(x) instead of the 
true distribution p(x) is given by 


- fr In q(x) dx — (- [r In p(x) ax) 


=- fom {aa} ax. (2.100) 


This is known as the relative entropy or Kullback—Leibler divergence, or KL diver- 
gence (Kullback and Leibler, 1951), between the distributions p(x) and g(x). Note 
that it is not a symmetrical quantity, that is to say KL(p||q) # KL(q||p). 

We now show that the Kullback—Leibler divergence satisfies KL(p||q) > 0 with 
equality if, and only if, p(x) = q(x). To do this we first introduce the concept of 
convex functions. A function f(x) is said to be convex if it has the property that 
every chord lies on or above the function, as shown in Figure 2.15. 

Any value of x in the interval from x = a to x = b can be written in the 
form Aa + (1 — A)b where 0 < A < 1. The corresponding point on the chord 


KL(p|lq) 
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Figure 2.15 A convex function f(z) is one for which ev- 
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ery chord (shown in blue) lies on or above 
the function (shown in red). 
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is given by Af (a) + (1 — A) f(b), and the corresponding value of the function is 
f(Aa + (1 — A)b). Convexity then implies 


fsa + (1 — A)b) < Af (a) + (1 — A) f(b). (2.101) 


This is equivalent to the requirement that the second derivative of the function be 
everywhere positive. Examples of convex functions are x ln x (for x > 0) and z?. A 
function is called strictly convex if the equality is satisfied only for A = 0 and à = 1. 
If a function has the opposite property, namely that every chord lies on or below the 
function, it is called concave, with a corresponding definition for strictly concave. If 
a function f(x) is convex, then — f (x) will be concave. 

Using the technique of proof by induction, we can show from (2.101) that a 
convex function f(x) satisfies 


M M 
j (>. sn) <J Af (xi) (2.102) 
i=l i=l 


where A; > 0 and 5°, A; = 1, for any set of points {x;}. The result (2.102) is known 
as Jensen’s inequality. If we interpret the A; as the probability distribution over a 
discrete variable x taking the values {x;}, then (2.102) can be written 


f Ele}) < E[f(a)] (2.103) 


where EJ-] denotes the expectation. For continuous variables, Jensen’s inequality 
takes the form 


f ( | xp(x) ax) < f f(x)p(x) dx. (2.104) 


We can apply Jensen’s inequality in the form (2.104) to the Kullback—Leibler 
divergence (2.100) to give 


KL(pllq) = - f p) In aa} dx > -m f ax dx = 0 (2.105) 
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where we have used — ln x is a convex function, together with the normalization 
condition i q(x) dx = 1. In fact, — ln z is a strictly convex function, so the equality 
will hold if, and only if, q(x) = p(x) for all x. Thus, we can interpret the Kullback— 
Leibler divergence as a measure of the dissimilarity of the two distributions p(x) and 
q(x). 

We see that there is an intimate relationship between data compression and den- 
sity estimation (i.e., the problem of modelling an unknown probability distribution) 
because the most efficient compression is achieved when we know the true distri- 
bution. If we use a distribution that is different from the true one, then we must 
necessarily have a less efficient coding, and on average the additional information 
that must be transmitted is (at least) equal to the Kullback—Leibler divergence be- 
tween the two distributions. 

Suppose that data is being generated from an unknown distribution p(x) that we 
wish to model. We can try to approximate this distribution using some parametric 
distribution q(x|@), governed by a set of adjustable parameters 0. One way to de- 
termine @ is to minimize the Kullback—Leibler divergence between p(x) and q(x|8) 
with respect to 8. We cannot do this directly because we do not know p(x). Suppose, 
however, that we have observed a finite set of training points Xn, for n = 1,...,N, 
drawn from p(x). Then the expectation with respect to p(x) can be approximated by 
a finite sum over these points, using (2.40), so that 


Lella) = rot In q(%n |) + m p(n) } (2.106) 


The second term on the right-hand side of (2.106) is independent of 0, and the first 
term is the negative log likelihood function for O under the distribution qg(x|@) eval- 
uated using the training set. Thus, we see that minimizing this Kullback—Leibler 
divergence is equivalent to maximizing the log likelihood function. 


2.5.6 Conditional entropy 


Now consider the joint distribution between two sets of variables x and y given 
by p(x, y) from which we draw pairs of values of x and y. If a value of x is already 
known, then the additional information needed to specify the corresponding value of 
y is given by — In p(y|x). Thus the average additional information needed to specify 
y can be written as 


H[y|x] = - [fv p(y, x) In p(y|x) dy dx, (2.107) 


which is called the conditional entropy of y given x. It is easily seen, using the 
product rule, that the conditional entropy satisfies the relation: 


H[x, y] = H[y|x] + H[x] (2.108) 


where H[x, y] is the differential entropy of p(x, y) and H[x] is the differential en- 
tropy of the marginal distribution p(x). Thus, the information needed to describe x 
and y is given by the sum of the information needed to describe x alone plus the 
additional information required to specify y given x. 


54 2. PROBABILITIES 


Exercise 2.38 


2.6. 


2.5.7 Mutual information 


When two variables x and y are independent, their joint distribution will factor- 
ize into the product of their marginals p(x, y) = p(x)p(y). If the variables are not 
independent, we can gain some idea of whether they are ‘close’ to being independent 
by considering the Kullback—Leibler divergence between the joint distribution and 
the product of the marginals, given by 


I[x,y] = p(x, y)||p(x)p(y)) 


ae x,y ym (2 nauw, o) dxdy, (2.109) 


which is called the mutual information between the variables x and y. From the 
properties of the Kullback—Leibler divergence, we see that I[x, y] > 0 with equal- 
ity if, and only if, x and y are independent. Using the sum and product rules of 
probability, we see that the mutual information is related to the conditional entropy 
through 


Ilx, y] = H[x] — H[xly] = Hy] — H[y|x]. (2.110) 
Thus, the mutual information represents the reduction in the uncertainty about x by 
virtue of being told the value of y (or vice versa). From a Bayesian perspective, we 
can view p(x) as the prior distribution for x and p(x|y) as the posterior distribution 
after we have observed new data y. The mutual information therefore represents the 
reduction in uncertainty about x as a consequence of the new observation y. 


Bayesian Probabilities 


When we considered the bent coin in Figure 2.2, we introduced the concept of prob- 
ability in terms of the frequencies of random, repeatable events, such as the prob- 
ability of the coin landing concave side up. We will refer to this as the classical 
or frequentist interpretation of probability. We also introduced the more general 
Bayesian view, in which probabilities provide a quantification of uncertainty. In this 
case, our uncertainty is whether the concave side of the coin is heads or tails. 

The use of probability to represent uncertainty is not an ad hoc choice but is 
inevitable if we are to respect common sense while making rational and coherent 
inferences. For example, Cox (1946) showed that if numerical values are used to 
represent degrees of belief, then a simple set of axioms encoding common sense 
properties of such beliefs leads uniquely to a set of rules for manipulating degrees of 
belief that are equivalent to the sum and product rules of probability. It is therefore 
natural to refer to these quantities as (Bayesian) probabilities. 

For the bent coin we assumed, in the absence of further information, that the 
probability of the concave side of the coin being heads is 0.5. Now suppose we 
are told the results of flipping the coin a few times. Intuitively, it seems that this 
should provide us with some information as to whether the concave side is heads. 
For instance, suppose we see many more flips that land tails than land heads. Given 
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that the coin is more likely to land concave side up, this provides evidence to suggest 
that the concave side is more likely to be tails. In fact, this intuition is correct, and 
furthermore, we can quantify this using the rules of probability. Bayes’ theorem now 
acquires a new significance, because it allows us to convert the prior probability for 
the concave side being heads into a posterior probability by incorporating the data 
provided by the coin flips. Moreover, this process is iterative, meaning the posterior 
probability becomes the prior for incorporating data from further coin flips. 

One aspect of the Bayesian viewpoint is that the inclusion of prior knowledge 
arises naturally. Suppose, for instance, that a fair-looking coin is tossed three times 
and lands heads each time. The maximum likelihood estimate of the probability 
of landing heads would give 1, implying that all future tosses will land heads! By 
contrast, a Bayesian approach with any reasonable prior will lead to a less extreme 
conclusion. 


2.6.1 Model parameters 


The Bayesian perspective provides valuable insights into several aspects of ma- 
chine learning, and we can illustrate these using the sine curve regression example. 
Here we denote the training data set by D. We have already seen in the context of 
linear regression that the parameters can be chosen using maximum likelihood, in 
which w is set to the value that maximizes the likelihood function p(D|w). This 
corresponds to choosing the value of w for which the probability of the observed 
data set is maximized. In the machine learning literature, the negative log of the 
likelihood function is called an error function. Because the negative logarithm is a 
monotonically decreasing function, maximizing the likelihood is equivalent to min- 
imizing the error. This leads to a specific choice of parameter values, denoted wm, 
which are then used to make predictions for new data. 

We have seen that different choices of training data set, for example containing 
different numbers of data points, give rise to different solutions for wm. From a 
Bayesian perspective, we can also use the machinery of probability theory to describe 
this uncertainty in the model parameters. We can capture our assumptions about w, 
before observing the data, in the form of a prior probability distribution p(w). The 
effect of the observed data D is expressed through the likelihood function p(D|w), 
and Bayes’ theorem now takes the form 


PEO) 

pD) 
which allows us to evaluate the uncertainty in w after we have observed D in the 
form of the posterior probability p(w|D). 

It is important to emphasize that the quantity p(D|w) is called the likelihood 
function when it is viewed as a function of the parameter vector w, and it expresses 
how probable the observed data set is for different values of w. Note that the likeli- 
hood p(D|w) is not a probability distribution over w, and its integral with respect to 
w does not (necessarily) equal one. 

Given this definition of likelihood, we can state Bayes’ theorem in words: 


p(w|D) = (2.111) 


posterior « likelihood x prior (2.112) 
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where all of these quantities are viewed as functions of w. The denominator in 
(2.111) is the normalization constant, which ensures that the posterior distribution 
on the left-hand side is a valid probability density and integrates to one. Indeed, by 
integrating both sides of (2.111) with respect to w, we can express the denominator 
in Bayes’ theorem in terms of the prior distribution and the likelihood function: 


p(D) = [@nrcw) dw. (2.113) 


In both the Bayesian and frequentist paradigms, the likelihood function p(D|w) 
plays a central role. However, the manner in which it is used is fundamentally dif- 
ferent in the two approaches. In a frequentist setting, w is considered to be a fixed 
parameter, whose value is determined by some form of ‘estimator’, and error bars on 
this estimate are determined (conceptually, at least) by considering the distribution 
of possible data sets D. By contrast, from the Bayesian viewpoint there is only a 
single data set D (namely the one that is actually observed), and the uncertainty in 
the parameters is expressed through a probability distribution over w. 


2.6.2 Regularization 


We can use this Bayesian perspective to gain insight into the technique of regu- 
larization that was used in the sine curve regression example to reduce over-fitting. 
Instead of choosing the model parameters by maximizing the likelihood function 
with respect to w, we can maximize the posterior probability (2.111). This technique 
is called the maximum a posteriori estimate, or simply MAP estimate. Equivalently, 
we can minimize the negative log of the posterior probability. Taking negative logs 
of both sides of (2.111), we have 


—Inp(w|D) = —Inp(D|w) — In p(w) + Inp(D). (2.114) 


The first term on the right-hand side of (2.114) is the usual log likelihood. The third 
term can be omitted since it does not depend on w. The second term takes the form 
of a function of w, which is added to the log likelihood, and we can recognize this 
as a form of regularization. To make this more explicit, suppose we choose the prior 
distribution p(w) to be the product of independent zero-mean Gaussian distributions 
for each of the elements of w such that each has the same variance s? so that 


M M 1 1/2 i2 
m In e2 _ Wi 
= [UCs )= Il (z) exp { x l : (2.115) 
Substituting into (2.114), we obtain 
—Inp(w|D) = — ln p(D|w) + D + const. (2.116) 


If we consider the particular case of the linear regression model whose log likeli- 
hood is given by (2.66), then we find that maximizing the posterior distribution is 
equivalent to minimizing the function 
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ie 1 
w= 553 Dy (En, W yn kaa Y (2.117) 


We see that this takes the form of the regularized sum-of-squares error function en- 
countered earlier in the form (1.4). 


2.6.3 Bayesian machine learning 


The Bayesian perspective has allowed us to motivate the use of regularization 
and to derive a specific form for the regularization term. However, the use of Bayes’ 
theorem alone does not constitute a truly Bayesian treatment of machine learning 
since it is still finding a single solution for w and does not therefore take account 
of uncertainty in the value of w. Suppose we have a training data set D and our 
goal is to predict some target variable t given a new input value x. We are therefore 
interested in the distribution of t given both x and D. From the sum and product 
rules of probability, we have 


p(t|z,D) = J ple, w)pwiD) dw. (2.118) 


We see that the prediction is obtained by taking a weighted average p(t|x, w) over all 
possible values of w in which the weighting function is given by the posterior prob- 
ability distribution p(w|D). The key difference that distinguishes Bayesian methods 
is this integration over the space of parameters. By contrast, conventional frequentist 
methods use point estimates for parameters obtained by optimizing a loss function 
such as a regularized sum-of-squares. 

This fully Bayesian treatment of machine learning offers some powerful in- 
sights. For example, the problem of over-fitting, encountered earlier in the context 
of polynomial regression, is an example of a pathology arising from the use of max- 
imum likelihood, and does not arise when we marginalize over parameters using the 
Bayesian approach. Similarly, we may have multiple potential models that we could 
use to solve a given problem, such as polynomials of different orders in the regres- 
sion example. A maximum likelihood approach simply picks the model that gives 
the highest probability of the data, but this favours ever more complex models, lead- 
ing to over-fitting. A fully Bayesian treatment involves averaging over all possible 
models, with the contribution of each model weighted by its posterior probability. 
Moreover, this probability is typically highest for models of intermediate complexity. 
Very simple models (such as polynomials of low order) have low probability as they 
are unable to fit the data well, whereas very complex models (such as polynomials 
of very high order) also have low probability because the Bayesian integration over 
parameters automatically and elegantly penalizes complexity. For a comprehensive 
overview of Bayesian methods applied to machine learning, including neural net- 
works, see Bishop (2006). 

Unfortunately, there is a major drawback with the Bayesian framework, and 
this is apparent in (2.118), which involves integrating over the space of parameters. 
Modern deep learning models can have millions or billions of parameters and even 
simple approximations to such integrals are typically infeasible. In fact, given a 
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limited compute budget and an ample source of training data, it will often be better to 
apply maximum likelihood techniques, generally augmented with one or more forms 
of regularization, to a large neural network rather than apply a Bayesian treatment to 
a much smaller model. 


(x) In the cancer screening example, we used a prior probability of cancer of p(C = 
1) = 0.01. In reality, the prevalence of cancer is generally very much lower. Con- 
sider a situation in which p(C = 1) = 0.001, and recompute the probability of 
having cancer given a positive test p(C = 1|T = 1). Intuitively, the result can ap- 
pear surprising to many people since the test seems to have high accuracy and yet a 
positive test still leads to a low probability of having cancer. 


(x x) Deterministic numbers satisfy the property of transitivity, so that if x > y and 
y > z then it follows that x > z. When we go to random numbers, however, this 
property need no longer apply. Figure 2.16 shows a set of four cubical dice that have 
been arranged in a cyclic order. Show that each of the four dice has a 2/3 probability 
of rolling a higher number than the previous die in the cycle. Such dice are known 
as non-transitive dice, and the specific examples shown here are called Efron dice. 


An example of non-transitive cu- 


A : : , " 3 
bical dice, in which each die 
has been ‘flattened’ to reveal the 


numbers on each of the faces. lal 3 
The dice have been arranged in a y% 
a cycle, such that each die has a 3 w 


2/3 probability of rolling a higher Pre 


number than the previous die in 


2 i 
the cycle. Bw L 
fal 


(x) Consider a variable y given by the sum of two independent random variables 
y = u + v where u ~ p,(u) and v ~ p,(v). Show that the distribution py (y) is 
given by 


ply) = f Pu(u)py(y — u) du. (2.119) 
This is known as the convolution of py(u) and py (v). 


(xx) Verify that the uniform distribution (2.33) is correctly normalized, and find 
expressions for its mean and variance. 


(x x) Verify that the exponential distribution (2.34) and the Laplace distribution (2.35) 
are correctly normalized. 
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(x) Using the properties of the Dirac delta function, show that the empirical density 
(2.37) is correctly normalized. 


(x) By making use of the empirical density (2.37), show that the expectation given 
by (2.39) can be approximated by a sum over a finite set of samples drawn from the 
density in the form (2.40). 


(x) Using the definition (2.44), show that var|f()] satisfies (2.45). 
(x) Show that if two variables x and y are independent, then their covariance is zero. 


(x) Suppose that the two variables x and z are statistically independent. Show that 
the mean and variance of their sum satisfies 


Elx +z] = Ejz]+ Elz] (2.120) 
varz +z] = var[z] + var[z]. (2.121) 


(x) Consider two variables x and y with joint distribution p(x, y). Prove the follow- 
ing two results: 


Eje] = E, [Es[ely] (2.122) 
var[z] = E, [var,[x|y]] + vary [E,[z|y]] - (2.123) 


Here E,,[x|y] denotes the expectation of x under the conditional distribution p(z|y), 
with a similar notation for the conditional variance. 


(x x x) In this exercise, we prove the normalization condition (2.51) for the univariate 
Gaussian. To do this consider, the integral 


i 1 
I= l exp | —~—2? ] dz (2.124) 
are 20? 
which we can evaluate by first writing its square in the form 
ref. T exp a ee dz dy (2.125) 
ae ee 202 202 ` 


Now make the transformation from Cartesian coordinates (x, y) to polar coordinates 
(r, 0) and then substitute u = r?°. Show that, by performing the integrals over 0 and 
u and then taking the square root of both sides, we obtain 


I = (2r0?) (2.126) 


Finally, use this result to show that the Gaussian distribution N (x|u, 07) is normal- 
ized. 
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(x x) By using a change of variables, verify that the univariate Gaussian distribution 
given by (2.49) satisfies (2.52). Next, by differentiating both sides of the normaliza- 
tion condition > 

‘| N (z|u,07) dx =1 (2.127) 
with respect to o°, verify that the Gaussian satisfies (2.53). Finally, show that (2.54) 
holds. 


(x) Show that the mode (i.e., the maximum) of the Gaussian distribution (2.49) is 
given by u. 


(x) By setting the derivatives of the log likelihood function (2.56) with respect to u 
and g? equal to zero, verify the results (2.57) and (2.58). 


(x x) Using the results (2.52) and (2.53), show that 


E[tn2m] = p? + Inmo? (2.128) 


where z,, and £m denote data points sampled from a Gaussian distribution with mean 
u and variance o° and Ipm satisfies Inm = 1 if n = m and Inm = 0 otherwise. 
Hence prove the results (2.59) and (2.60). 


(x x) Using the definition (2.61), prove the result (2.62) which shows that the expec- 
tation of the variance estimator for a Gaussian distribution based on the true mean is 
given by the true variance o°. 


(x) Show that maximizing (2.66) with respect to o? gives the result (2.68). 


(x x) Use the transformation property (2.71) of a probability density under a change 
of variable to show that any density p(y) can be obtained from a fixed density q(x) 
that is everywhere non-zero by making a nonlinear change of variable y = f(x) in 
which f(a) is a monotonic function so that 0 < f'(x) < oo. Write down the differ- 
ential equation satisfied by f(x) and draw a diagram illustrating the transformation 
of the density. 


(x) Evaluate the elements of the Jacobian matrix for the transformation defined by 
(2.78) and (2.79). 


(x) In Section 2.5, we introduced the idea of entropy h(x) as the information gained 
on observing the value of a random variable x having distribution p(x). We saw 
that, for independent variables x and y for which p(x, y) = p(a)p(y), the entropy 
functions are additive, so that h(x, y) = h(a) + h(y). In this exercise, we derive the 
relation between h and p in the form of a function h(p). First show that h(p?) = 
2h(p) and, hence, by induction that h(p") = nh(p) where n is a positive integer. 
Hence, show that h(p"/”) = (n/m)h(p) where m is also a positive integer. This 
implies that h(p”) = xh(p) where x is a positive rational number and, hence, by 
continuity when it is a positive real number. Finally, show that this implies h(p) 
must take the form h(p) œ ln p. 
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(x) Use a Lagrange multiplier to show that maximization of the entropy (2.86) for a 
discrete variable gives a distribution in which all of the probabilities p(x;) are equal 
and that the corresponding value of the entropy is then In M. 


(x) Consider an M-state discrete random variable x, and use Jensen’s inequality in 
the form (2.102) to show that the entropy of its distribution p(x) satisfies H[z] < 
In M. 


(x x) Use the calculus of variations to show that the stationary point of the functional 
(2.96) is given by (2.97). Then use the constraints (2.93), (2.94), and (2.95) to elim- 
inate the Lagrange multipliers and, hence, show that the maximum entropy solution 
is given by the Gaussian (2.98). 


(x) Use the results (2.94) and (2.95) to show that the entropy of the univariate Gaus- 
sian (2.98) is given by (2.99). 


(x x) Suppose that p(x) is some fixed distribution and that we wish to approximate it 
using a Gaussian distribution q(x) = N’(x|u, ©). By writing down the form of the 
Kullback—Leibler divergence KL(p||q) for a Gaussian q(x) and then differentiating, 
show that minimization of KL(p||q) with respect to u and © leads to the result that 
Lis given by the expectation of x under p(x) and that © is given by the covariance. 


(x x) Evaluate the Kullback—Leibler divergence (2.100) between the two Gaussians 
p(z) = N (ap, 07) and q(x) = N (z|m, s). 


(xx) The alpha family of divergences is defined by 


4 
Da(pllg) =a F (1 - Oan O ar) (2.129) 


where —oo < & < œ is a continuous parameter. Show that the Kullback—Leibler 
divergence KL(p||q) corresponds to a —> 1. This can be done by writing pS = 
exp(elInp) = 1 + ¢lInp + O(e?) and then taking € > 0. Similarly, show that 
KL(q||p) corresponds to a > —1. 


(x x) Consider two variables x and y having joint distribution p(x, y). Show that the 
differential entropy of this pair of variables satisfies 


H[x, y] < H[x] + Hy] (2.130) 
with equality if, and only if, x and y are statistically independent. 


(x) Consider a vector x of continuous variables with distribution p(x) and corre- 
sponding entropy H[x]. Suppose that we make a non-singular linear transformation 
of x to obtain a new variable y = Ax. Show that the corresponding entropy is given 
by H[y] = H[x] + ln det A where det A denotes the determinant of A. 


(x x) Suppose that the conditional entropy H[y|x] between two discrete random vari- 
ables x and y is zero. Show that, for all values of x such that p(x) > 0, the variable 
y must be a function of x. In other words, for each x there is only one value of y 
such that p(y|x) A 0. 
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(x) A strictly convex function is defined as one for which every chord lies above the 
function. Show that this is equivalent to the condition that the second derivative of 
the function is positive. 


(x x) Using proof by induction, show that the inequality (2.101) for convex functions 
implies the result (2.102). 


(x) Show that, up to an additive constant, the Kullback—Leibler divergence (2.100) 
between the empirical distribution (2.37) and a model distribution g(x|@) is equal to 
the negative log likelihood function. 


(x) Using the definition (2.107) together with the product rule of probability, prove 
the result (2.108). 


(x x x) Consider two binary variables x and y having the joint distribution given by 


Evaluate the following quantities: 


(a) H{z] (© H[y|z] (e) H{z, y] 
(b) H[y] (d) H[z|y] ®© Ilx, y]. 


Draw a Venn diagram to show the relationship between these various quantities. 


(x) By applying Jensen’s inequality (2.102) with f(x) = ln x, show that the arith- 
metic mean of a set of real numbers is never less than their geometrical mean. 


(x) Using the sum and product rules of probability, show that the mutual information 
I(x, y) satisfies the relation (2.110). 


(xx) Suppose that two variables zı and z2 are independent so that p(z1,2z2) = 
p(21)p(Z2). Show that the covariance matrix between these variables is diagonal. 
This shows that independence is a sufficient condition for two variables to be uncor- 
related. Now consider two variables yı and y2 where yı is symmetrically distributed 
around 0 and y2 = y?. Write down the conditional distribution p(y2|y1) and observe 
that this is dependent on y;, thus showing that the two variables are not independent. 
Now show that the covariance matrix between these two variables is again diagonal. 
To do this, use the relation p(y1, y2) = p(y1)p(ye|y1) to show that the off-diagonal 
terms are zero. This counterexample shows that zero correlation is not a sufficient 
condition for independence. 


(x) Consider the bent coin in Figure 2.2. Assume that the prior probability that the 
convex side is heads is 0.1. Now suppose the coin is flipped 10 times and we are 
told that eight of the flips landed heads up and two of the flips landed tails up. Use 
Bayes’ theorem to evaluate the posterior probability that the concave side is heads. 
Calculate the probability that the next flip will land heads up. 
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2.41 (x) By substituting (2.115) into (2.114) and making use of the result (2.66) for the log 
likelihood of the linear regression model, derive the result (2.117) for the regularized 
error function. 


Section 1.2 
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Standard 
Distributions 


In this chapter we discuss some specific examples of probability distributions and 
their properties. As well as being of interest in their own right, these distributions 
can form building blocks for more complex models and will be used extensively 


throughout the book. 
One role for the distributions discussed in this chapter is to model the prob- 
ability distribution p(x) of a random variable x, given a finite set x,,...,x~ of 


observations. This problem is known as density estimation. It should be emphasized 
that the problem of density estimation is fundamentally ill-posed, because there are 
infinitely many probability distributions that could have given rise to the observed fi- 
nite data set. Indeed, any distribution p(x) that is non-zero at each of the data points 
X1,...,Xwy İS a potential candidate. The issue of choosing an appropriate distribu- 
tion relates to the problem of model selection, which has already been encountered 
in the context of polynomial curve fitting and which is a central issue in machine 
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learning. 

We begin by considering distributions for discrete variables before exploring 
the Gaussian distribution for continuous variables. These are specific examples of 
parametric distributions, so called because they are governed by a relatively small 
number of adjustable parameters, such as the mean and variance of a Gaussian. To 
apply such models to the problem of density estimation, we need a procedure for 
determining suitable values for the parameters, given an observed data set, and our 
main focus will be on maximizing the likelihood function. In this chapter, we will 
assume that the data observations are independent and identically distributed (1.i.d.), 
whereas in future chapters we will explore more complex scenarios involving struc- 
tured data where this assumption no longer holds. 

One limitation of the parametric approach is that it assumes a specific functional 
form for the distribution, which may turn out to be inappropriate for a particular 
application. An alternative approach is given by nonparametric density estimation 
methods in which the form of the distribution typically depends on the size of the data 
set. Such models still contain parameters, but these control the model complexity 
rather than the form of the distribution. We end this chapter by briefly considering 
three nonparametric methods based respectively on histograms, nearest neighbours, 
and kernels. A major limitation of nonparametric techniques such as these is that 
they involve storing all the training data. In other words, the number of parameters 
grows with the size of the data set, so that the method become very inefficient for 
large data sets. Deep learning combines the efficiency of parametric models with the 
generality of nonparametric methods by considering flexible distributions based on 
neural networks having a large, but fixed, number of parameters. 


Discrete Variables 


We begin by considering simple distributions for discrete variables, starting with 
binary variables and then moving on to multi-state variables. 
3.1.1 Bernoulli distribution 


Consider a single binary random variable x € {0,1}. For example, x might 
describe the outcome of flipping a coin, with x = 1 representing ‘heads’ and x = 0 
representing ‘tails’. If this were a damaged coin, such as the one shown in Figure 2.2, 
the probability of landing heads is not necessarily the same as that of landing tails. 
The probability of x = 1 will be denoted by the parameter u so that 


p(t = 1|u) = p (3.1) 


where 0 < ys < 1, from which it follows that p(x = Oļu) = 1 — u. The probability 
distribution over x can therefore be written in the form 


Bern(2|) = u? (1 = u)”, (3.2) 


which is known as the Bernoulli distribution. It is easily verified that this distribution 
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is normalized and that it has mean and variance given by 


E[z] = pu (3.3) 
var[z] = p(1l— yp). (3.4) 
Now suppose we have a data set D = {z1,..., £y} of observed values of x. 


We can construct the likelihood function, which is a function of jz, on the assumption 
that the observations are drawn independently from p(|,:), so that 


N N 
pju) = | [pero = [a u. (3.5) 
n=1 n=1 


We can estimate a value for u by maximizing the likelihood function or equivalently 
by maximizing the logarithm of the likelihood, since the log is a monotonic function. 
The log likelihood function of the Bernoulli distribution is given by 


N N 
npa = J npl = So {an Inp+ (L—an)M(1—p)}. BO 


At this point, note that the log likelihood function depends on the N observations £n 
only through their sum )°,, £n. This sum provides an example of a sufficient statistic 
for the data under this distribution. If we set the derivative of ln p(D|/) with respect 
to u equal to zero, we obtain the maximum likelihood estimator: 


1 N 
pun = 55 2, Tn, (3.7) 


which is also known as the sample mean. Denoting the number of observations of 
x = 1 (heads) within this data set by m, we can write (3.7) in the form 


m 
ML = 7G (3.8) 


so that the probability of landing heads is given, in this maximum likelihood frame- 
work, by the fraction of observations of heads in the data set. 


3.1.2 Binomial distribution 


We can also work out the distribution for the binary variable x of the number 
m of observations of x = 1, given that the data set has size N. This is called the 
binomial distribution, and from (3.5) we see that it is proportional to y” (1 — p) =™. 
To obtain the normalization coefficient, note that out of N coin flips, we have to add 
up all of the possible ways of obtaining m heads, so that the binomial distribution 
can be written as 


N 
Bin(m|N, u) = (>) p” iia (3.9) 
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Histogram plot of the binomial 
distribution (3.9) as a function of 
m for N = 10 and „u = 0.25. 


where 


(>) = EPL (3.10) 


m (N —m)!m! 


is the number of ways of choosing m objects out of a total of N identical objects 
without replacement. Figure 3.1 shows a plot of the binomial distribution for N = 10 
and u = 0.25. 

The mean and variance of the binomial distribution can be found by using the 
results that, for independent events, the mean of the sum is the sum of the means and 
the variance of the sum is the sum of the variances. Because m = zı +... + £N 
and because for each observation the mean and variance are given by (3.3) and (3.4), 
respectively, we have 


N 
Efm] = > m Bin(m|N, u) = Nu (3.11) 
m=0 
N 
var[m] = (m — E[m])* Bin(m|N, u) = Nu(1 — p). (3.12) 
m=0 


These results can also be proved directly by using calculus. 


3.1.3 Multinomial distribution 


Binary variables can be used to describe quantities that can take one of two 
possible values. Often, however, we encounter discrete variables that can take on 
one of K possible mutually exclusive states. Although there are various alternative 
ways to express such variables, we will see shortly that a particularly convenient 
representation is the 1-of-K scheme, sometimes called ‘one-hot encoding’, in which 
the variable is represented by a k-dimensional vector x in which one of the elements 
x, equals 1 and all remaining elements equal 0. So, for instance, if we have a variable 
that can take K =6 states and a particular observation of the variable happens to 
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correspond to the state where x3 = 1, then x will be represented by 


x = (0,0,1,0,0,0)". (3.13) 


Note that such vectors satisfy De £k = 1. If we denote the probability of x, = 1 
by the parameter ug, then the distribution of x is given by 


p(x|H) = II m" (3.14) 


where u = (u1, ..., ux)”, and the parameters ju, are constrained to satisfy up > 0 
and $`; Hk = 1, because they represent probabilities. The distribution (3.14) can be 
regarded as a generalization of the Bernoulli distribution to more than two outcomes. 
It is easily seen that the distribution is normalized: 


K 
So ela) =>) ae = 1 (3.15) 
x k=1 


and that 


E[x|4] = dv (xx|0)x (3.16) 


Now consider a data set D of N independent observations x,,...,xy. The 
corresponding likelihood function takes the form 


K 
p(D|u) = Il Il up = I or [La (3.17) 
k=1 


n=1k=1 


where we see that the likelihood function depends on the N data points only through 
the K quantities: 


N 
mMk = Y ini (3.18) 


which represent the number of observations of xy = 1. These are called the sufficient 
statistics for this distribution. Note that the variables mę are subject to the constraint 


K 
`> mp =N. (3.19) 
k=1 


To find the maximum likelihood solution for p, we need to maximize In p(D| u) 
with respect to 4, taking account of the constraint (3.15) that the u must sum to 
one. This can be achieved using a Lagrange multiplier \ and maximizing 


K K 
y: mp ln uk + À (>: — 1) f (3.20) 
k=1 k=1 
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3.2. 


Setting the derivative of (3.20) with respect to juz to zero, we obtain 
Hk = —My/A. (3.21) 


We can solve for the Lagrange multiplier A by substituting (3.21) into the constraint 
>, Hk = 1 to give A = —N. Thus, we obtain the maximum likelihood solution for 


Lk in the form 
ML _ "k 


yeh =, (3.22) 
which is the fraction of the N observations for which x; = 1. 
We can also consider the joint distribution of the quantities mı, . .., mg, condi- 


tioned on the parameter vector u and on the total number N of observations. From 
(3.17), this takes the form 


N K 
Mult(mi, Mmo, ... mgl, N) = ( ) tp”; (3.23) 

mMıM2... MK kel 
which is known as the multinomial distribution. The normalization coefficient is the 
number of ways of partitioning N objects into K groups of size m1,..., mg and is 

given by 
N N! 
= —— (3.24) 
Mim... MK milmə!...mg! 


Note that two-state quantities can be represented either as binary variables and 
modelled using the binomial distribution (3.9) or as 1-of-2 variables and modelled 
using the distribution (3.14) with K = 2. 


The Multivariate Gaussian 


The Gaussian, also known as the normal distribution, is a widely used model for 
the distribution of continuous variables. We have already seen that for of a single 
variable x, the Gaussian distribution can be written in the form 


1 1 
Nelo?) = op exp { —pea(e- 1)? 6.25) 


where u is the mean and o? is the variance. For a D-dimensional vector x, the 


multivariate Gaussian distribution takes the form 


i, | 1 
Nea: D) = DA aA exp { z=- p) E(x- w) (3.26) 


where yz is the D-dimensional mean vector, X is the D x D covariance matrix, and 
det & denotes the determinant of X. 

The Gaussian distribution arises in many different contexts and can be motivated 
from a variety of different perspectives. For example, we have already seen that for 
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Figure 3.2 Histogram plots of the mean of N uniformly distributed numbers for various values of N. We 
observe that as N increases, the distribution tends towards a Gaussian. 
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a single real variable, the distribution that maximizes the entropy is the Gaussian. 
This property applies also to the multivariate Gaussian. 

Another situation in which the Gaussian distribution arises is when we consider 
the sum of multiple random variables. The central limit theorem tells us that, subject 
to certain mild conditions, the sum of a set of random variables, which is of course 
itself a random variable, has a distribution that becomes increasingly Gaussian as the 
number of terms in the sum increases (Walker, 1969). We can illustrate this by con- 
sidering N variables x71,..., £y each of which has a uniform distribution over the 
interval [0, 1] and then considering the distribution of the mean (xı +---+2yN)/N. 
For large N, this distribution tends to a Gaussian, as illustrated in Figure 3.2. In 
practice, the convergence to a Gaussian as N increases can be very rapid. One con- 
sequence of this result is that the binomial distribution (3.9), which is a distribution 
over m defined by the sum of N observations of the random binary variable x, will 
tend to a Gaussian as N — oo (see Figure 3.1 for N = 10). 

The Gaussian distribution has many important analytical properties, and we will 
consider several of these in detail. As a result, this section will be rather more tech- 
nically involved than some of the earlier sections and will require familiarity with 
various matrix identities. 


3.2.1 Geometry of the Gaussian 


We begin by considering the geometrical form of the Gaussian distribution. The 
functional dependence of the Gaussian on x is through the quadratic form 


A? = (x= pp)’ DS‘ (x — p), (3.27) 


which appears in the exponent. The quantity A is called the Mahalanobis distance 
from p to x. It reduces to the Euclidean distance when © is the identity matrix. 
The Gaussian distribution is constant on surfaces in x-space for which this quadratic 
form is constant. 

First, note that the matrix © can be taken to be symmetric, without loss of gen- 
erality, because any antisymmetric component would disappear from the exponent. 
Now consider the eigenvector equation for the covariance matrix 


Xu; = Aiu; (3.28) 
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where i = 1,..., D. Because © is a real, symmetric matrix, its eigenvalues will be 
real, and its eigenvectors can be chosen to form an orthonormal set, so that 


where J;,; is the 7, 7 element of the identity matrix and satisfies 


1, ifi=j 
I; — : . $ 
a { 0, otherwise. a0) 
The covariance matrix © can be expressed as an expansion in terms of its eigenvec- 
tors in the form 


D 
D=) Auu (3.31) 


and similarly the inverse covariance matrix ~' can be expressed as 


es 
= uf. (3.32) 


Substituting (3.32) into (3.27), the quadratic form becomes 
D r 
=Y ot (3.33) 


where we have defined 
yi = uj (x — p). (3.34) 


We can interpret {y;} as a new coordinate system defined by the orthonormal vectors 
u; that are shifted and rotated with respect to the original x; coordinates. Forming 
the vector y = (y1,---, yp)", we have 


y =U(x-p) (3.35) 


where U is a matrix whose rows are given by u;. From (3.29) it follows that U is 
an orthogonal matrix, i.e., it satisfies UUT = UTU = I, where I is the identity 
matrix. 

The quadratic form, and hence the Gaussian density, is constant on surfaces for 
which (3.33) is constant. If all the eigenvalues A; are positive, then these surfaces 
represent ellipsoids, with their centres at ys and their axes oriented along u;, and with 


scaling factors in the directions of the axes given by X / * as illustrated in Figure 3.3. 

For the Gaussian distribution to be well defined, it is necessary for all the eigen- 
values A; of the covariance matrix to be strictly positive, otherwise the distribution 
cannot be properly normalized. A matrix whose eigenvalues are strictly positive is 
said to be positive definite. When we discuss latent variable models, we will en- 
counter Gaussian distributions for which one or more of the eigenvalues are zero, in 
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Figure 3.3. The red curve shows the ellip- 72 


tical surface of constant proba- i 
bility density for a Gaussian in 
a two-dimensional space x = uy 


(a1, £2) on which the density is 
exp(—1/2) of its value at x = 
p. The axes of the ellipse are 
defined by the eigenvectors u; 
of the covariance matrix, with 
corresponding eigenvalues A;. 


Tı 


which case the distribution is singular and is confined to a subspace of lower dimen- 
sionality. If all the eigenvalues are non-negative, then the covariance matrix is said 
to be positive semidefinite. 

Now consider the form of the Gaussian distribution in the new coordinate system 
defined by the y;. In going from the x to the y coordinate system, we have a Jacobian 
matrix J with elements given by 


ps ae. (3.36) 
Oy; ? 


where Uj; are the elements of the matrix UT. Using the orthonormality property of 
the matrix U, we see that the square of the determinant of the Jacobian matrix is 


ja? = [U7 =[U"| 1U] = [U7 U| = i = 1 63D 


and, hence, |J| = 1. Also, the determinant |%| of the covariance matrix can be 
written as the product of its eigenvalues, and hence 


D 
JE)? = TT ay”. (3.38) 
j=l 
Thus, in the y; coordinate system, the Gaussian distribution takes the form 
D 1 j 
p(y) = p(x)|J| = Igam- (3.39) 
j=1 


which is the product of D independent univariate Gaussian distributions. The eigen- 
vectors therefore define a new set of shifted and rotated coordinates with respect 
to which the joint probability distribution factorizes into a product of independent 
distributions. The integral of the distribution in the y coordinate system is then 


D co 2 
= 1 Yj 
fow) dy -I SANE epf Cal dyj=1 (3.40) 
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where we have used the result (2.51) for the normalization of the univariate Gaussian. 
This confirms that the multivariate Gaussian (3.26) is indeed normalized. 


3.2.2 Moments 


We now look at the moments of the Gaussian distribution and thereby provide an 
interpretation of the parameters yz and X. The expectation of x under the Gaussian 
distribution is given by 


1 1 Ty-l 
~ aomen | ef- = ah (ab ude (3.41) 


where we have changed variables using z = x — p. Note that the exponent is an even 
function of the components of z, and because the integrals over these are taken over 
the range (—oo, oo), the term in z in the factor (z + ws) will vanish by symmetry. 
Thus, 


Elx] = p, (3.42) 


and so we refer to yz as the mean of the Gaussian distribution. 

We now consider second-order moments of the Gaussian. In the univariate case, 
we considered the second-order moment given by E[x?]. For the multivariate Gaus- 
sian, there are D? second-order moments given by E[x;x;], which we can group 
together to form the matrix E[xx7]. This matrix can be written as 


1 1 1 
meet 2 o o a T 
[xx] CADE fef 5 (x py E (x »)} xx dx 
a exp st (z+p)(z+p)'dz (3.43) 
(27) P/2 [£12 2 ` 


where again we have changed variables using z = x — yz. Note that the cross-terms 
involving wz! and "z will again vanish by symmetry. The term uu” is constant 
and can be taken outside the integral, which itself is unity because the Gaussian 
distribution is normalized. Consider the term involving zz?. Again, we can make 
use of the eigenvector expansion of the covariance matrix given by (3.28), together 
with the completeness of the set of eigenvectors, to write 


D 
z=% yjuj (3.44) 
j=1 
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where yj = uj z, which gives 
1 1 l re- T 
apre | ef- > z} zz dz 


1 1 22 a 
— yt = k U 
(2r)P/2 |5|1/2 yes Muy fæl > rr YiYyj dy 


i=1 j=1 


D 
=) uu A; == (3.45) 


where we have made use of the eigenvector equation (3.28), together with the fact 
that the integral on the middle line vanishes by symmetry unless 7 = j. In the final 
line we have made use of the results (2.53) and (3.38), together with (3.31). Thus, 
we have 


E[xx’] = pp” + X. (3.46) 


When defining the variance for a single random variable, we subtracted the mean 
before taking the second moment. Similarly, in the multivariate case it is again 
convenient to subtract off the mean, giving rise to the covariance of a random vector 
x defined by 


cov[x] = E |(x — E[x])(x — E[x])*)] . (3.47) 


For the specific case of a Gaussian distribution, we can make use of E[x] = yp, 
together with the result (3.46), to give 


cov[x] = X. (3.48) 


Because the parameter matrix © governs the covariance of x under the Gaussian 
distribution, it is called the covariance matrix. 


3.2.3 Limitations 


Although the Gaussian distribution (3.26) is often used as a simple density 
model, it suffers from some significant limitations. Consider the number of free 
parameters in the distribution. A general symmetric covariance matrix X will have 
D(D + 1)/2 independent parameters, and there are another D independent parame- 
ters in p, giving D(D + 3)/2 parameters in total. For large D, the total number of 
parameters therefore grows quadratically with D, and the computational task of ma- 
nipulating and inverting the large matrices can become prohibitive. One way to ad- 
dress this problem is to use restricted forms of the covariance matrix. If we consider 
covariance matrices that are diagonal, so that © = diag(a?), we then have a total 
of 2D independent parameters in the density model. The corresponding contours of 
constant density are given by axis-aligned ellipsoids. We could further restrict the 
covariance matrix to be proportional to the identity matrix, © = o7I, known as an 
isotropic covariance, giving D + 1 independent parameters in the model together 
with spherical surfaces of constant density. The three possibilities of general, diag- 
onal, and isotropic covariance matrices are illustrated in Figure 3.4. Unfortunately, 
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Figure 3.4 Contours of constant 
probability density for a Gaussian 
distribution in two dimensions in 
which the covariance matrix is (a) 
of general form, (b) diagonal, in 
which case the elliptical contours 
are aligned with the coordinate axes, 
and (c) proportional to the identity 
matrix, in which case the contours Ly Tı Tı 
are concentric circles. (a) (b) (c) 


whereas such approaches limit the number of degrees of freedom in the distribu- 
tion and make inversion of the covariance matrix a much faster operation, they also 
greatly restrict the form of the probability density and limit its ability to capture 
interesting correlations in the data. 

A further limitation of the Gaussian distribution is that it is intrinsically uni- 
modal (i.e., has a single maximum) and so is unable to provide a good approximation 
to multimodal distributions. Thus, the Gaussian distribution can be both too flexible, 
in the sense of having too many parameters, and too limited in the range of distribu- 
tions that it can adequately represent. We will see later that the introduction of latent 
variables, also called hidden variables or unobserved variables, allows both of these 
problems to be addressed. In particular, a rich family of multimodal distributions is 

Section 3.2.9 obtained by introducing discrete latent variables leading to mixtures of Gaussians. 
Similarly, the introduction of continuous latent variables leads to models in which the 
number of free parameters can be controlled independently of the dimensionality D 
of the data space while still allowing the model to capture the dominant correlations 
Chapter 16 in the data set. 


3.2.4 Conditional distribution 


An important property of a multivariate Gaussian distribution is that if two sets 
of variables are jointly Gaussian, then the conditional distribution of one set condi- 
tioned on the other is again Gaussian. Similarly, the marginal distribution of either 
set is also Gaussian. 

First, consider the case of conditional distributions. Suppose that x is a D- 
dimensional vector with Gaussian distribution N (x|, ©) and that we partition x 
into two disjoint subsets x, and xy. Without loss of generality, we can take x, 
to form the first M components of x, with x, comprising the remaining D — M 


components, so that 
Xa 
x= a . (3.49) 


We also define corresponding partitions of the mean vector jz given by 


_ (ba 
u= a (3.50) 
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and of the covariance matrix © given by 
X Xab 
X= a j. 3.51 
E Lob S21) 
Note that the symmetry ET = 5 of the covariance matrix implies that Xaa and Nyy 
are symmetric and that Ypa = X7. 
In many situations, it will be convenient to work with the inverse of the covari- 


ance matrix: 
A= >, (3.52) 


which is known as the precision matrix. In fact, we will see that some properties 
of Gaussian distributions are most naturally expressed in terms of the covariance, 
whereas others take a simpler form when viewed in terms of the precision. We 
therefore also introduce the partitioned form of the precision matrix: 


A Nap 
A = aa a i 
te i (13 


corresponding to the partitioning (3.49) of the vector x. Because the inverse of a 
symmetric matrix is also symmetric, we see that Aaa and Aj,» are symmetric and 
that Apa = AG. It should be stressed at this point that, for instance, Aga is not 
simply given by the inverse of Xaa. In fact, we will shortly examine the relation 
between the inverse of a partitioned matrix and the inverses of its partitions. 

We begin by finding an expression for the conditional distribution p(x,|x»). 
From the product rule of probability, we see that this conditional distribution can be 
evaluated from the joint distribution p(x) = p(xa, x») simply by fixing xz to the 
observed value and normalizing the resulting expression to obtain a valid probability 
distribution over x,. Instead of performing this normalization explicitly, we can 
obtain the solution more efficiently by considering the quadratic form in the exponent 
of the Gaussian distribution given by (3.27) and then reinstating the normalization 
coefficient at the end of the calculation. If we make use of the partitioning (3.49), 
(3.50), and (3.53), we obtain 


F(x - #)"S x- y) = 


1 1 
~ 9 (Xa am Ha) Reals E Ha) a 9 (Xa a Ha) Nav(Xp Tr Hy) 
1 1 
— 5 (Xb — Me)" Moa(%a — Ma) — 5% — Me) Aw — Hy). (3.54) 


We see that as a function of xq, this is again a quadratic form, and hence, the cor- 
responding conditional distribution p(x,|x,) will be Gaussian. Because this distri- 
bution is completely characterized by its mean and its covariance, our goal will be 
to identify expressions for the mean and covariance of p(x,|x,) by inspection of 
(3.54). 

This is an example of a rather common operation associated with Gaussian 
distributions, sometimes called ‘completing the square’, in which we are given a 
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quadratic form defining the exponent terms in a Gaussian distribution and we need 
to determine the corresponding mean and covariance. Such problems can be solved 
straightforwardly by noting that the exponent in a general Gaussian distribution 
N (x|, ©) can be written as 


1 1 
arcs =p Ea — u) = -3x Ex +x" tu + const (3.55) 


where ‘const’ denotes terms that are independent of x, We have also made use of 
the symmetry of X. Thus, if we take our general quadratic form and express it in 
the form given by the right-hand side of (3.55), then we can immediately equate the 
matrix of coefficients entering the second-order term in x to the inverse covariance 
matrix X~} and the coefficient of the linear term in x to X~ tp, from which we can 
obtain u. 

Now let us apply this procedure to the conditional Gaussian distribution p(xq|xp) 
for which the quadratic form in the exponent is given by (3.54). We will denote the 
mean and covariance of this distribution by pajp and 2J4\,, respectively. Consider 
the functional dependence of (3.54) on Xa in which x» is regarded as a constant. If 
we pick out all terms that are second order in Xa, we have 


1 
= Xa NaaXa (3.56) 


from which we can immediately conclude that the covariance (inverse precision) of 
p(Xa|Xp) is given by 


Dajo = Aza- (3.57) 
Now consider all the terms in (3.54) that are linear in x,: 
xa {Maata — Nab(X» — Hy) } (3.58) 


where we have used A;,, = Aap. From our discussion of the general form (3.55), 
the coefficient of x, in this expression must equal Daal and, hence, 


Hajo = Eaj {Aaaba — Aav(Xb — Hy)} 
= py — Nga Nao(Xs — My) (3.59) 


where we have made use of (3.57). 

The results (3.57) and (3.59) are expressed in terms of the partitioned precision 
matrix of the original joint distribution p(x,, X). We can also express these results 
in terms of the corresponding partitioned covariance matrix. To do this, we make use 
of the following identity for the inverse of a partitioned matrix: 


A B\ M -MBD-! (3.60) 
c D] ~\-D-'cM D-!+D-'CMBD-! i 


where we have defined 
M= (A - BDC). (3.61) 
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The quantity M~t is known as the Schur complement of the matrix on the left-hand 
side of (3.60) with respect to the submatrix D. Using the definition 


= 
Maa Dab Nia Nab 
= 3.62 
& =) Gs v) (aaz) 
and making use of (3.60), we have 


Aaa = (Zaa EE EaD Xba) (3.63) 
Aa = = (Eaa — Ead Xp Ua) Dp (3.64) 


From these we obtain the following expressions for the mean and covariance of the 
conditional distribution p(x,_|x,): 


Hajo = Bat EapDiyy (Xb — Ho) (3.65) 
Xaj = Daa - Dab Ery Yoa- (3.66) 


Comparing (3.57) and (3.66), we see that the conditional distribution p(x,|x,) takes 
a simpler form when expressed in terms of the partitioned precision matrix than 
when it is expressed in terms of the partitioned covariance matrix. Note that the 
mean of the conditional distribution p(xa|x+), given by (3.65), is a linear function of 
x, and that the covariance, given by (3.66), is independent of x». This represents an 
example of a linear-Gaussian model. 


3.2.5 Marginal distribution 


We have seen that if a joint distribution p(x,, X) is Gaussian, then the condi- 
tional distribution p(x,|x,) will again be Gaussian. Now we turn to a discussion of 
the marginal distribution given by 


P(Xa) = foen) dxp, (3.67) 


which, as we will see, is also Gaussian. Once again, our strategy for calculating this 
distribution will be to focus on the quadratic form in the exponent of the joint distri- 
bution and thereby to identify the mean and covariance of the marginal distribution 
P(Xa). 

The quadratic form for the joint distribution can be expressed, using the parti- 
tioned precision matrix, in the form (3.54). Our goal is to integrate out xp, which is 
most easily achieved by first considering the terms involving x, and then completing 
the square to facilitate the integration. Picking out just those terms that involve xp, 
we have 


1 1 = = 1 = 
— 5X0 Aex +x m = = 5 (X_— Ay m)" AeA m)+ 5m" A Mm (3.68) 


where we have defined 


m = Nop by = Aba(Xa = Ha). (3.69) 
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We see that the dependence on x, has been cast into the standard quadratic form of a 
Gaussian distribution corresponding to the first term on the right-hand side of (3.68) 
plus a term that does not depend on x, (but that does depend on xa). Thus, when 
we take the exponential of this quadratic form, we see that the integration over x, 
required by (3.67) will take the form 


1 
f exp {-360 — Aj,'m)* Apo (xp — Ajj'm)} dxp. (3.70) 


This integration is easily performed by noting that it is the integral over an unnor- 
malized Gaussian, and so the result will be the reciprocal of the normalization coef- 
ficient. We know from the form of the normalized Gaussian given by (3.26) that this 
coefficient is independent of the mean and depends only on the determinant of the 
covariance matrix. Thus, by completing the square with respect to xp, we can inte- 
grate out x, so that the only term remaining from the contributions on the left-hand 
side of (3.68) that depends on x, is the last term on the right-hand side of (3.68) in 
which m is given by (3.69). Combining this term with the remaining terms from 
(3.54) that depend on Xa, we obtain 


1 _ 
z [Aobh — Aba (Xa — Mal” Ay [Abo Ha — Aba (Xa — Ha )] 


1 
— 5%0 AaaXa + x! (Aaaha + Aab Hy) + const 


1 
= — 5X0 (Aaa — AgoAj, Ava)Xa 
+xi (Aaa — Aab Ayp Ava) Ha + const (3.71) 


where ‘const’ denotes quantities independent of x,. Again, by comparison with 
(3.55), we see that the covariance of the marginal distribution p(x,,) is given by 


Ea = (Aaa — Aab App Aba)". (3.72) 
Similarly, the mean is given by 
DalAaa po Aab A Ava) ha = Wa (3.73) 


where we have used (3.72). The covariance (3.72) is expressed in terms of the par- 
titioned precision matrix given by (3.53). We can rewrite this in terms of the cor- 
responding partitioning of the covariance matrix given by (3.51), as we did for the 
conditional distribution. These partitioned matrices are related by 


-1 
Naa Nab Yaa Dab 
= i 3.74 
a =] ok 
Making use of (3.60), we then have 


(Ags = Aap Aigi Ava) = Zaa- (3.75) 
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Thus, we obtain the intuitively satisfying result that the marginal distribution p(x,) 
has mean and covariance given by 


Exa] = m (3.76) 
cov[xa] = Xaa- (3.77) 


We see that for a marginal distribution, the mean and covariance are most simply ex- 
pressed in terms of the partitioned covariance matrix, in contrast to the conditional 
distribution for which the partitioned precision matrix gives rise to simpler expres- 
sions. 

Our results for the marginal and conditional distributions of a partitioned Gaus- 
sian can be summarized as follows. Given a joint Gaussian distribution N (x| u, ©) 
with A = ©~' and the following partitions 


_ [Xa — (Ha 
w= (%), w= (ts) om% 


Daa Dab Aaa Nab 
y= A= 3.79 
& i Gs o (42) 


then the conditional distribution is given by 


p(xalxo) = N(x| Majo» Aza) (3.80) 
Hab = Ba Aza Aobh% — My) (3.81) 

and the marginal distribution is given by 
P(Xa) = N (XalMas Zaa). (3.82) 


We illustrate the idea of conditional and marginal distributions associated with 
a multivariate Gaussian using an example involving two variables in Figure 3.5. 


3.2.6 Bayes’ theorem 


In Sections 3.2.4 and 3.2.5 we considered a Gaussian p(x) in which we parti- 
tioned the vector x into two subvectors x = (Xa, X») and then found expressions 
for the conditional distribution p(x,|x»,) and the marginal distribution p(x). We 
noted that the mean of the conditional distribution p(x_|x,) was a linear function of 
x». Here we will suppose that we are given a Gaussian marginal distribution p(x) 
and a Gaussian conditional distribution p(y|x) in which p(y|x) has a mean that is a 
linear function of x and a covariance that is independent of x. This is an example 
of a linear-Gaussian model (Roweis and Ghahramani, 1999). We wish to find the 
marginal distribution p(y) and the conditional distribution p(x|y). This is a struc- 
ture that arises in several types of generative model and it will prove convenient to 
derive the general results here. 

We will take the marginal and conditional distributions to be 


p(x) = NN (xla, A~) (3.83) 
P(y|x) N (y|Ax +b, L~*) (3.84) 
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Figure 3.5 (a) Contours of a Gaussian distribution p(£za, xẹ) over two variables. (b) The marginal distribution 
p(x) (blue curve) and the conditional distribution p(£a|x») for x, = 0.7 (red curve). 


where u, A, and b are parameters governing the means, and A and L are precision 
matrices. If x has dimensionality M and y has dimensionality D, then the matrix A 
has size D x M. 

First we find an expression for the joint distribution over x and y. To do this, we 


define 
x 
z= (*) (3.85) 


and then consider the log of the joint distribution: 
Inp(z) = Inp(x) + np(y|x) 
1 
= —5(e- WT A(x — p) 


1 
—5(y — Ax — b)"L(y — Ax — b) + const (3.86) 


where ‘const’ denotes terms independent of x and y. As before, we see that this is a 
quadratic function of the components of z, and hence, p(z) is Gaussian distribution. 
To find the precision of this Gaussian, we consider the second-order terms in (3.86), 
which can be written as 


1 1 1 1 
=A + ATLA)x — ay hy + 3y LAx + 5X A’ Ly 


_ 1 fx\" (A+ ATLA -ATL) (x) _ lr 
= L(Y (AAAA -A'L (A) teams an 


and so the Gaussian distribution over z has precision (inverse covariance) matrix 
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given by 
_(A+ATLA -ATL 
R= ( _LA L iF (3.88) 
The covariance matrix is found by taking the inverse of the precision, which can be 
done using the matrix inversion formula (3.60) to give 


AT} A TAT ) 


AA L-!+AAtAT oes 


cov[z] = R! = ( 


Similarly, we can find the mean of the Gaussian distribution over z by identify- 
ing the linear terms in (3.86), which are given by 


T _ at 
xTAp — x" ATLb + y"Lb = (*) ~ "a D 80) 


Using our earlier result (3.55) obtained by completing the square over the quadratic 
form of a multivariate Gaussian, we find that the mean of z is given by 


Elz] = R! ("P p w . (3.91) 


Lb 
Making use of (3.89), we then obtain 


E[z] = ( reat ») (3.92) 


Next we find an expression for the marginal distribution p(y) in which we have 
marginalized over x. Recall that the marginal distribution over a subset of the com- 
ponents of a Gaussian random vector takes a particularly simple form when ex- 
pressed in terms of the partitioned covariance matrix. Specifically, its mean and 
covariance are given by (3.76) and (3.77), respectively. Making use of (3.89) and 
(3.92), we see that the mean and covariance of the marginal distribution p(y) are 
given by 


Ely] = Apw+b (3.93) 
cov[y] = Lt + AA™AT. (3.94) 


A special case of this result is when A = I, in which case the marginal distribution 
reduces to the convolution of two Gaussians, for which we see that the mean of the 
convolution is the sum of the means of the two Gaussians and the covariance of the 
convolution is the sum of their covariances. 

Finally, we seek an expression for the conditional p(x|y). Recall that the results 
for the conditional distribution are most easily expressed in terms of the partitioned 
precision matrix, using (3.57) and (3.59). Applying these results to (3.89) and (3.92), 
we see that the conditional distribution p(x|y) has mean and covariance given by 


E[xly) = (A+ATLA)!{ATL(y -b)+ Ap} (3.95) 
cov[x|y]) = (A+ATLA) +. (3.96) 
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Appendix A 


The evaluation of this conditional distribution can be seen as an example of 
Bayes’ theorem, in which we interpret p(x) as a prior distribution over x. If the 
variable y is observed, then the conditional distribution p(x|y) represents the corre- 
sponding posterior distribution over x. Having found the marginal and conditional 
distributions, we have effectively expressed the joint distribution p(z) = p(x)p(y|x) 
in the form p(x|y)p(y). 

These results can be summarized as follows. Given a marginal Gaussian distri- 
bution for x and a conditional Gaussian distribution for y given x in the form 


p(x) = Nau, A?) (3.97) 
p(y|x) = N(y|Ax +b, L“), (3.98) 


then the marginal distribution of y and the conditional distribution of x given y are 
given by 


ply) = N(y|Aw+b,L7!+AA*A™T) (3.99) 
p(xly) = N(x|Z{ATL(y — b) + Ap}, 5) (3.100) 
where 
= = (A + ATLA). (3.101) 
3.2.7 Maximum likelihood 
Given a data set X = (x1,..., Xy)" in which the observations {xn} are as- 


sumed to be drawn independently from a multivariate Gaussian distribution, we can 
estimate the parameters of the distribution by maximum likelihood. The log likeli- 
hood function is given by 


N 
ND N 1 E 
In p(X|p, £) = 5 In(27) 3 ln|®]| 2 (xn-H) = (xn— H). (3.102) 


n=1 


By simple rearrangement, we see that the likelihood function depends on the data set 
only through the two quantities 


N N 
Y Rai D aai (3.103) 
n=1 n=1 


These are known as the sufficient statistics for the Gaussian distribution. Using 
(A.19), the derivative of the log likelihood with respect to pz is given by 


a 
ap” Xju, 5 Soa (3.104) 


and setting this derivative to zero, we obtain the solution for the maximum likelihood 
estimate of the mean: 


1 
UML = N >, Xn, (3.105) 
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which is the mean of the observed set of data points. The maximization of (3.102) 
with respect to X is rather more involved. The simplest approach is to ignore the 
symmetry constraint and show that the resulting solution is symmetric as required. 
Alternative derivations of this result, which impose the symmetry and positive defi- 
niteness constraints explicitly, can be found in Magnus and Neudecker (1999). The 
result is as expected and takes the form 
1 
Xm. = N S n = Py) (Xn — mL)”; (3.106) 
n=1 
which involves ymz, because this is the result of a joint maximization with respect 
to u and &. Note that the solution (3.105) for mr, does not depend on Xmu, and so 
we can first evaluate yz, and then use this to evaluate Xm. 
If we evaluate the expectations of the maximum likelihood solutions under the 
true distribution, we obtain the following results 


Ejum] = H (3.107) 
: N-i 
[Eu] = HA. (3.108) 


We see that the expectation of the maximum likelihood estimate for the mean is equal 
to the true mean. However, the maximum likelihood estimate for the covariance has 
an expectation that is less than the true value, and hence, it is biased. We can correct 
this bias by defining a different estimator © given by 


N 
~ 1 
2= 7] 2 Cn — pyr.) (Xn — Me)” (3.109) 


Clearly from (3.106) and (3.108), the expectation of Š is equal to X. 


3.2.8 Sequential estimation 


Our discussion of the maximum likelihood solution represents a batch method 
in which the entire training data set is considered at once. An alternative is to use 
sequential methods, which allow data points to be processed one at a time and then 
discarded. These are important for online applications and for large data when the 
batch processing of all data points at once is infeasible. 

Consider the result (3.105) for the maximum likelihood estimator of the mean 
Ly, Which we will denote by pn) when it is based on N observations. If we 
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Figure 3.6 Plots of the Old Faith- 
ful data in which the red curves are 
contours of constant probability den- 
sity. (a) A single Gaussian distribu- 
tion which has been fitted to the data 
using maximum likelihood. Note that 
this distribution fails to capture the 
two clumps in the data and indeed 
places much of its probability mass 
in the central region between the 
clumps where the data are relatively 
sparse. (b) The distribution given by 
a linear combination of two Gaus- 
sians, also fitted by maximum likeli- 
hood, which gives a better represen- 
tation of the data. 
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dissect out the contribution from the final data point xy, we obtain 


1 N 
N 
nip = af 
n=1 
1 1 N-1 
= wun TW Xn 
n=1 
1 N-1 (N-1) 
= win N HML 
= 1 = 
= pm + 5p — HME”): (3.110) 


This result has a nice interpretation, as follows. After observing N — 1 data points, 
we estimate u by aie We now observe data point xy, and we obtain our revised 
estimate pn by moving the old estimate a small amount, proportional to 1/N, in 


the direction of the ‘error signal’ (xy — je YY. Note that, as N increases, so the 
contributions from successive data points get smaller. 


3.2.9 Mixtures of Gaussians 


Although the Gaussian distribution has some important analytical properties, it 
suffers from significant limitations when used to model modelling real data sets. 
Consider the example shown in Figure 3.6(a). This is known as the ‘Old Faithful’ 
data set, and comprises 272 measurements of the eruption of the Old Faithful geyser 
in Yellowstone National Park in the USA. Each measurement gives the duration of 
the eruption in minutes (horizontal axis) and the time in minutes to the next eruption 
(vertical axis). We see that the data set forms two dominant clumps, and that a simple 
Gaussian distribution is unable to capture this structure. 

We might expect that a superposition of two Gaussian distributions would be 
able to do a much better job of representing the structure in this data set, and indeed 


Figure 3.7 Example of a Gaussian mixture distri- 
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bution in one dimension showing three 
Gaussians (each scaled by a coefficient) 
in blue and their sum in red. 


this proves to be the case, as can be seen from Figure 3.6(b). Such superpositions, 
formed by taking linear combinations of more basic distributions such as Gaussians, 
can be formulated as probabilistic models known as mixture distributions. In this sec- 
tion we will consider Gaussians to illustrate the framework of mixture models. More 
generally, mixture models can comprise linear combinations of other distributions, 
for example mixtures of Bernoulli distributions for binary variables. In Figure 3.7 we 
see that a linear combination of Gaussians can give rise to very complex densities. 
By using a sufficient number of Gaussians and by adjusting their means and covari- 
ances as well as the coefficients in the linear combination, almost any continuous 
distribution can be approximated to arbitrary accuracy. 
We therefore consider a superposition of K Gaussian densities of the form 


K 
P(x) = So teN(x| Mg, Ee), (3.111) 


k=1 


which is called a mixture of Gaussians. Each Gaussian density N (x| up, Xp) is 
called a component of the mixture and has its own mean u, and covariance bx. 
Contour and surface plots for a Gaussian mixture in two dimensions having three 
components are shown in Figure 3.8. 

The parameters 7;, in (3.111) are called mixing coefficients. If we integrate both 
sides of (3.111) with respect to x, and note that both p(x) and the individual Gaussian 
components are normalized, we obtain 


K 
yee (3.112) 
k=1 
Also, given that N (x| up, Xx) > 0, a sufficient condition for the requirement p(x) > 
0 is that m% > 0 for all k. Combining this with the condition (3.112), we obtain 
O<m <1. (3.113) 


We can therefore see that the mixing coefficients satisfy the requirements to be prob- 
abilities, and we will show that this probabilistic interpretation of mixture distribu- 
tions is very powerful. 
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Figure 3.8 


3. STANDARD DISTRIBUTIONS 


Tı 


(b) (c) 


Illustration of a mixture of three Gaussians in a two-dimensional space. (a) Contours of constant 


density for each of the mixture components, in which the three components are denoted red, blue, and green, and 
the values of the mixing coefficients are shown below each component. (b) Contours of the marginal probability 
density p(x) of the mixture distribution. (c) A surface plot of the distribution p(x). 


From the sum and product rules of probability, the marginal density can be writ- 
ten as 


= Sn p(x|k), (3.114) 


which is equivalent to (3.111) in sich we can view 7% = p(k) as the prior proba- 
bility of picking the kth component, and the density N (x|up, Ex) = p(x|k) as the 
probability of x conditioned on k. As we will see in later chapters, an important role 
is played by the corresponding posterior probabilities p(k|x), which are also known 
as responsibilities. From Bayes’ theorem, these are given by 


p(k|x) 
p(k) p(x|k) 
DROE 
TKN (X| My, De) 


= . 3.115 
5, mM (alps, 3) ane. 


The form of the Gaussian mixture distribution is governed by the parameters 7 
p, and ©, where we have used the notation 7 = {71,...,7K}, w= {My,---, UK}, 
and X = {¥;,... 5g}. One way to set the values of these parameters is to use 
maximum likelihood. From (3.111), the log of the likelihood function is given by 


Yn (x) 


Inp(X|z, p, £ )= Soin] Yom TET (3.116) 
n=1 


where X = {x1,..., Xy }. We immediately see that the situation is now much more 
complex than with a single Gaussian, due to the summation over k inside the log- 
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arithm. As a result, the maximum likelihood solution for the parameters no longer 
has a closed-form analytical solution. One approach for maximizing the likelihood 
function is to use iterative numerical optimization techniques. Alternatively, we can 
employ a powerful framework called expectation maximization, which has wide ap- 
plicability to a variety of different deep generative models. 


Periodic Variables 


Although Gaussian distributions are of great practical significance, both in their own 
right and as building blocks for more complex probabilistic models, there are situa- 
tions in which they are inappropriate as density models for continuous variables. One 
important case, which arises in practical applications, is that of periodic variables. 

An example of a periodic variable is the wind direction at a particular geographi- 
cal location. We might, for instance, measure the wind direction at multiple locations 
and wish to summarize this data using a parametric distribution. Another example 
is calendar time, where we may be interested in modelling quantities that are be- 
lieved to be periodic over 24 hours or over an annual cycle. Such quantities can 
conveniently be represented using an angular (polar) coordinate 0 < 0 < 27. 

We might be tempted to treat periodic variables by choosing some direction 
as the origin and then applying a conventional distribution such as the Gaussian. 
Such an approach, however, would give results that were strongly dependent on the 
arbitrary choice of origin. Suppose, for instance, that we have two observations at 
0, = 1° and #2 = 359°, and we model them using a standard univariate Gaussian 
distribution. If we place the origin at 0°, then the sample mean of this data set will be 
180° with standard deviation 179°, whereas if we place the origin at 180°, then the 
mean will be 0° and the standard deviation will be 1°. We clearly need to develop a 
special approach for periodic variables. 


3.3.1 Von Mises distribution 


Let us consider the problem of evaluating the mean of a set of observations 
D = {6,,...,0n} of a periodic variable 0 where @ is measured in radians. We have 
already seen that the simple average (6; +---+0y)/N will be strongly coordinate 
dependent. To find an invariant measure of the mean, note that the observations 
can be viewed as points on the unit circle and can therefore be described instead by 
two-dimensional unit vectors x;,...,x~ where ||x,|| = 1 for n = 1,..., N, as 
illustrated in Figure 3.9. We can average the vectors {Xn } instead to give 


=| 


1 N 
x= ye (3.117) 
n=1 


and then find the corresponding angle 0 of this average. Clearly, this definition will 
ensure that the location of the mean is independent of the origin of the angular coor- 
dinate. Note that X will typically lie inside the unit circle. The Cartesian coordinates 
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Figure 3.9 


Illustration of the representation of val- T2 

ues 6, of a periodic variable as two- x4 X3 
dimensional vectors xn living on the unit 

circle. Also shown is the average x of 

those vectors. 


X2 


Tı 
X1 


of the observations are given by Xn = (cos ôn, sin ôn), and we can write the Carte- 
sian coordinates of the sample mean in the form X = (7 cos 6,7 sin 0). Substituting 
into (3.117) and equating the xı and x2 components then gives 


N N 
tee ee ll ee ee ol : 
Tı = T cos = N D On, T2 = Tsin 0 = N 2 sin n. (3.118) 


Taking the ratio, and using the identity tan 0 = sin @/cos6, we can solve for 0 to 


give 
_ in bn 
Õ = tan`! jee}. (3.119) 


Shortly, we will see how this result arises naturally as a maximum likelihood estima- 
tor. 

First, we need to define a periodic generalization of the Gaussian called the 
von Mises distribution. Here we will limit our attention to univariate distributions, 
although analogous periodic distributions can also be found over hyperspheres of 
arbitrary dimension (Mardia and Jupp, 2000). 

By convention, we will consider distributions p(@) that have period 27. Any 
probability density p(@) defined over 6 must not only be non-negative and integrate 
to one, but it must also be periodic. Thus, p(@) must satisfy the three conditions: 


p(d) > 0 (3.120) 
20 
f p(0)dð = 1 (3.121) 
0 
p(0 +27) = p(6). (3.122) 


From (3.122), it follows that p(@ + M2r) = p(0) for any integer M. 
We can easily obtain a Gaussian-like distribution that satisfies these three prop- 
erties as follows. Consider a Gaussian distribution over two variables x = (£1, x2) 
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Figure 3.10 The von Mises distribution can be derived by considering 
a two-dimensional Gaussian of the form (3.123), whose 
density contours are shown in blue, and conditioning on 
the unit circle shown in red. 


having mean yz = (u1, 42) and a covariance matrix © = o7I where I is the 2 x 2 
identity matrix, so that 


(3.123) 


(zı — a = er} 


1 
P(@1, £2) = 2ng? exp 


The contours of constant p(x) are circles, as illustrated in Figure 3.10. 

Now suppose we consider the value of this distribution along a circle of fixed 
radius. Then by construction, this distribution will be periodic, although it will not 
be normalized. We can determine the form of this distribution by transforming from 
Cartesian coordinates (x1, £2) to polar coordinates (r, 0) so that 


zı = r cosð, z = rsinð. (3.124) 
We also map the mean p into polar coordinates by writing 
Hı = To COS Oo, H2 = Tro sin 0o. (3.125) 


Next we substitute these transformations into the two-dimensional Gaussian distribu- 
tion (3.123), and then condition on the unit circle r = 1, noting that we are interested 
only in the dependence on @. Focusing on the exponent in the Gaussian distribution 
we have 


E- {(r cos — ro cos ĝo)? + (r sin 0 — ro sin Oo)? $ 


1 
= pe {1 + rå — 2ro cos 8 cos ĝo — 2ro sin 8 sin Oo } 
o 


= z cos(0 — 0o) + const (3.126) 
where ‘const’ denotes terms independent of 0. We have made use of the following 
trigonometrical identities: 

cos? A + sin? A 
cos Á cos B + sin Asin B 


1 (3.127) 
cos(A — B). (3.128) 


If we now define m = ro/o”, we obtain our final expression for the distribution of 
p(0) along the unit circle r = 1 in the form 


p(9|00,m) = exp {mcos(@ — )}, (3.129) 


1 
2rlo(m) 
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— m=5, b =7/4 37/4 
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Figure 3.11 The von Mises distribution plotted for two different parameter values, shown as a Cartesian plot 
on the left and as the corresponding polar plot on the right. 
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which is called the von Mises distribution or the circular normal. Here the param- 
eter 0o corresponds to the mean of the distribution, whereas m, which is known 
as the concentration parameter, is analogous to the inverse variance (i.e. the pre- 
cision) for the Gaussian. The normalization coefficient in (3.129) is expressed in 
terms of Io(m), which is the zeroth-order modified Bessel function of the first kind 
(Abramowitz and Stegun, 1965) and is defined by 


1 2T 
Ip(m) = al exp {mcos 6} dé. (3.130) 


For large m, the distribution becomes approximately Gaussian. The von Mises dis- 
tribution is plotted in Figure 3.11, and the function J,(m) is plotted in Figure 3.12. 

Now consider the maximum likelihood estimators for the parameters 69 and m 
for the von Mises distribution. The log likelihood function is given by 


In p(D|60,m) = -N In(2r) — N ln Ig(m )+m esl — øo). (3.131) 
Setting the derivative with respect to 0o equal to zero gives 


3 sin(6, — 0o) = (3.132) 


To solve for 0o, we make use of the trigonometric identity 
sin(A — B) = cos B sin A — cos Asin B (3.133) 


from which we obtain 


3000 


1000 
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Figure 3.12 Plot of the Bessel function Jo(m) defined by (3.130), together with the function A(m) defined by 


(3.136). 


sin On, 
pM — tana} {eee}. (3.134) 


which we recognize as the result (3.119) obtained earlier for the mean of the obser- 
vations viewed in a two-dimensional Cartesian space. 

Similarly, maximizing (3.131) with respect to m and making use of f(m) = 
I,(m) (Abramowitz and Stegun, 1965), we have 


N 
1 
A(mmL) = 7 X cos(0n — 65°") (3.135) 
n=1 


where we have substituted for the maximum likelihood solution for 6}"" (recalling 
that we are performing a joint optimization over 0 and m), and we have defined 


A) = Fn) 


The function A(m) is plotted in Figure 3.12. Making use of the trigonometric iden- 
tity (3.128), we can write (3.135) in the form 


N N 
1 1 
A(mu) = (3 ` cos) cos OM + (3 ` sinn indy, (3.137) 
n=1 


n=1 


(3.136) 


The right-hand side of (3.137) is easily evaluated, and the function A(m) can be in- 
verted numerically. One limitation of the von Mises distribution is that it is unimodal. 
By forming mixtures of von Mises distributions, we obtain a flexible framework for 
modelling periodic variables that can handle multimodality. 

For completeness, we mention briefly some alternative techniques for construct- 
ing periodic distributions. The simplest approach is to use a histogram of observa- 
tions in which the angular coordinate is divided into fixed bins. This has the virtue of 
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simplicity and flexibility but also suffers from significant limitations, as we will see 
when we discuss histogram methods in more detail later. Another approach starts, 
like the von Mises distribution, from a Gaussian distribution over a Euclidean space 
but now marginalizes onto the unit circle rather than conditioning (Mardia and Jupp, 
2000). However, this leads to more complex forms of distribution and will not be 
discussed further. Finally, any valid distribution over the real axis (such as a Gaus- 
sian) can be turned into a periodic distribution by mapping successive intervals of 
width 27 onto the periodic variable (0,27), which corresponds to ‘wrapping’ the 
real axis around the unit circle. Again, the resulting distribution is more complex to 
handle than the von Mises distribution. 


The Exponential Family 


The probability distributions that we have studied so far in this chapter (with the 
exception of mixture models) are specific examples of a broad class of distributions 
called the exponential family (Duda and Hart, 1973; Bernardo and Smith, 1994). 
Members of the exponential family have many important properties in common, and 
it is illuminating to discuss these properties in some generality. 

The exponential family of distributions over x, given parameters 77, is defined to 
be the set of distributions of the form 


p(x|7) = h(x)g(n) exp {n* u(x) } (3.138) 


where x may be scalar or vector and may be discrete or continuous. Here 77 are called 
the natural parameters of the distribution, and u(x) is some function of x. The 
function g(7) can be interpreted as the coefficient that ensures that the distribution 
is normalized, and therefore, it satisfies 


a(n) f h(x) exp {n"u(x)} dx =1 (3.139) 


where the integration is replaced by summation if x is a discrete variable. 

We begin by taking some examples of the distributions introduced earlier in 
the chapter and showing that they are indeed members of the exponential family. 
Consider first the Bernoulli distribution: 


p(x|u) = Bern(z|u) = p” (1 — u). (3.140) 


Expressing the right-hand side as the exponential of the logarithm, we have 


exp {xz1n u + (1 — x)ln(1 — p)} 


= (1—y)exp {in (+) ch (3.141) 


Comparison with (3.138) allows us to identify 


n=lIn (+) (3.142) 
l—-yp 


p(æ|u) 
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which we can solve for p to give u = o(7), where 


1 


— 14 
1+ exp(—7) or) 


a(n) 


is called the logistic sigmoid function. Thus, we can write the Bernoulli distribution 
using the standard representation (3.138) in the form 


p(x|n) = o(—n) exp(nz) (3.144) 


where we have used 1 — o(7) = a(—7), which is easily proved from (3.143). Com- 
parison with (3.138) shows that 


uz) = xz (3.145) 
h(z) = 1 (3.146) 
gn) = o(-7). (3.147) 


Next consider the multinomial distribution which, for a single observation x, 


takes the form 
M M 
pxu) = | [ n" = exp > Tp ln m) (3.148) 
k=1 k=1 


where x = (x1,..., £m)". Again, we can write this in the standard representation 
(3.138) so that 


p(x|n) = exp(n"x) (3.149) 
where 7, = In pz, and we have defined n = (m,... nm)”. Again, comparing with 
(3.138) we have 

u(x) = x (3.150) 
h(x) = 1 (3.151) 
gn) = 1. (3.152) 


Note that the parameters ny are not independent because the parameters up are sub- 
ject to the constraint 


3 =l (3.153) 


so that, given any M — 1 of the parameters up, the value of the remaining parameter 
is fixed. In some circumstances, it will be convenient to remove this constraint by 
expressing the distribution in terms of only M — 1 parameters. This can be achieved 
by using the relationship (3.153) to eliminate um by expressing it in terms of the 
remaining {up} where k = 1,..., M — 1, thereby leaving M — 1 parameters. Note 
that these remaining parameters are still subject to the constraints 


M-1 
O< pe <1, ee <b (3.154) 
k=1 
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Making use of the constraint (3.153), the multinomial distribution in this representa- 
tion then becomes 


M 
exp [5 Tk mn) 


k=1 
M-1 M-1 M-1 
= on} £k In Hk + (- `> z) In (- > ms) 
k=1 k=1 k=1 


M-1 M-1 
= exp 5 £k ln ( = + In (: `> m) ) . (3.155) 
k=1 i= i Hj k=1 


We now identify 


Hk 
In | ———— | = nk, (3.156) 


which we can solve for up by first summing both sides over k and then rearranging 
and back-substituting to give 

exp(nk) 
1 +>); exp(7;) 


This is called the softmax function or the normalized exponential. In this representa- 
tion, the multinomial distribution therefore takes the form 


uk = (3.157) 


M-1 = 
p(x|n) = (: yy a) exp(nTx). (3.158) 
k=1 


This is the standard form of the exponential family, with parameter vector 7 = 
(m,---;7M-—1)" in which 


u(x) = x (3.159) 
h(x) = 1 (3.160) 


M-1 ra 
( + a) (3.161) 
k=1 


Finally, let us consider the Gaussian distribution. For the univariate Gaussian, 
we have 


g(n) 


1 1 
p(a|u,o7) = mE | 552 wr} (3.162) 


1 l 2, Æ 1 2 
(Ono)? exp { 5522 + at zr}, (3.163) 


which, after some simple rearranging, can be cast in the standard exponential family 
Exercise 3.35 form (3.138) with 


Exercise 3.36 
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n = es (3.164) 
( 


u(r) = 2) (3.165) 
h(x) = (2r)? (3.166) 
2 
g(n) = (2m) exp (7). (3.167) 
Ans 


Finally, we shall sometimes make use of a restricted form of (3.138) in which 
we choose u(x) = x. However, this can be somewhat generalized by noting that if 


f (x) is a normalized density then 
1 1 
=f ($x) (3.168) 
s` \s 


is also a normalized density, where s > 0 is a scale parameter. Combining these, we 
arrive at a restricted set of exponential family class-conditional densities of the form 


1 1 1 
p(x|Ax, s) = zh (<x) g(Ax) exp fatx} ; (3.169) 


Note that we are allowing each class to have its own parameter vector A;, but we are 
assuming that the classes share the same scale parameter s. 


3.4.1 Sufficient statistics 


Let us now consider the problem of estimating the parameter vector 7 in the gen- 
eral exponential family distribution (3.138) using the technique of maximum likeli- 
hood. Taking the gradient of both sides of (3.139) with respect to 7, we have 


Vain) | h(x) exp {n"u(x)} dx 


+ g(n) frw exp {n"u(x)} u(x) dx = 0. (3.170) 


Rearranging and making use again of (3.139) then gives 
1 
g(n) 


We therefore obtain the result 


Vg(n) = a(n) f hx) exp {n"u(x)} u(x) dx = Eļu(x)]. (3.171) 


—V Ing(n) = Efu(x)]. (3.172) 


Note that the covariance of u(x) can be expressed in terms of the second derivatives 
of g(7), and similarly for higher-order moments. Thus, provided we can normalize a 
distribution from the exponential family, we can always find its moments by simple 
differentiation. 
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Now consider a set of independent identically distributed data denoted by X = 
{xX1,...,Xn}, for which the likelihood function is given by 


N 
p(X|n) = (il h(Xn )) am N exp > vcs.) ‘ (3.173) 


Setting the gradient of In p(X|7) with respect to 7 to zero, we get the following 
condition to be satisfied by the maximum likelihood estimator nmz: 


N 
-V Ing(nyr) vue (3.174) 


which can in principle be solved to obtain nm. We see that the solution for the 
maximum likelihood estimator depends on the data only through 5°, u(x,,), which 
is therefore called the sufficient statistic of the distribution (3.138). We do not need 
to store the entire data set itself but only the value of the sufficient statistic. For 
the Bernoulli distribution, for example, the function u(x) is given just by x and 
so we need only keep the sum of the data points {£n}, whereas for the Gaussian 
u(x) = (x, x?)T, and so we should keep both the sum of {z,,} and the sum of {22 }. 

If we consider the limit N — oo, then the right-hand side of (3.174) becomes 
E[u(x)], and so by comparing with (3.172) we see that in this limit, nyy, will equal 
the true value 77. 


Nonparametric Methods 


Throughout this chapter, we have focused on the use of probability distributions 
having specific functional forms governed by a small number of parameters whose 
values are to be determined from a data set. This is called the parametric approach 
to density modelling. An important limitation of this approach is that the chosen 
density might be a poor model of the distribution that generates the data, which can 
result in poor predictive performance. For instance, if the process that generates the 
data is multimodal, then this aspect of the distribution can never be captured by a 
Gaussian, which is necessarily unimodal. In this final section, we consider some 
nonparametric approaches to density estimation that make few assumptions about 
the form of the distribution. 


3.5.1 Histograms 


Let us start with a discussion of histogram methods for density estimation, which 
we have already encountered in the context of marginal and conditional distributions 
in Figure 2.5 and in the context of the central limit theorem in Figure 3.2. Here we 
explore the properties of histogram density models in more detail, focusing on cases 
with a single continuous variable x. Standard histograms simply partition x into 
distinct bins of width A; and then count the number n; of observations of x falling 
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Figure 3.13 An illustration of the histogram 


approach to density estimation, 5 A — 0.04 
in which a data set of 50 data 
points is generated from the dis- | ann _ 
tribution shown by the green 0 
curve. Histogram density esti- a 0.5 l 
mates, based on (3.175) with a A =0.08 
common bin width A, are shown 
for various values of A. K ——_-=——F 
0 0.5 1 
5 
A =0.25 
0 
0 0.5 1 


in bin 7. To turn this count into a normalized probability density, we simply divide 
by the total number N of observations and by the width A; of the bins to obtain 
probability values for each bin: 

Ni 


~ NA, 


for which it is easily seen that f p(x)dz = 1. This gives a model for the density 
p(x) that is constant over the width of each bin. Often the bins are chosen to have 
the same width A; = A. 

In Figure 3.13, we show an example of histogram density estimation. Here the 
data is drawn from the distribution corresponding to the green curve, which is formed 
from a mixture of two Gaussians. Also shown are three examples of histogram 
density estimates corresponding to three different choices for the bin width A. We 
see that when A is very small (top figure), the resulting density model is very spiky, 
with a lot of structure that is not present in the underlying distribution that generated 
the data set. Conversely, if A is too large (bottom figure) then the result is a model 
that is too smooth and consequently fails to capture the bimodal property of the 
green curve. The best results are obtained for some intermediate value of A (middle 
figure). In principle, a histogram density model is also dependent on the choice of 
edge location for the bins, though this is typically much less significant than the bin 
width A. 

Note that the histogram method has the property (unlike the methods to be dis- 
cussed shortly) that, once the histogram has been computed, the data set itself can 
be discarded, which can be advantageous if the data set is large. Also, the histogram 
approach is easily applied if the data points arrive sequentially. 

In practice, the histogram technique can be useful for obtaining a quick visual- 
ization of data in one or two dimensions but is unsuited to most density estimation 
applications. One obvious problem is that the estimated density has discontinuities 
that are due to the bin edges rather than any property of the underlying distribution 
that generated the data. A major limitation of the histogram approach is its scal- 
ing with dimensionality. If we divide each variable in a D-dimensional space into 


Pi (3.175) 
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M bins, then the total number of bins will be MP. This exponential scaling with 
D is an example of the curse of dimensionality. In a space of high dimensionality, 
the quantity of data needed to provide meaningful estimates of the local probability 
density would be prohibitive. 

The histogram approach to density estimation does, however, teach us two im- 
portant lessons. First, to estimate the probability density at a particular location, 
we should consider the data points that lie within some local neighbourhood of that 
point. Note that the concept of locality requires that we assume some form of dis- 
tance measure, and here we have been assuming Euclidean distance. For histograms, 
this neighbourhood property was defined by the bins, and there is a natural ‘smooth- 
ing’ parameter describing the spatial extent of the local region, in this case the bin 
width. Second, to obtain good results, the value of the smoothing parameter should 
be neither too large nor too small. This is reminiscent of the choice of model com- 
plexity in polynomial regression where the degree M of the polynomial, or alterna- 
tively the value A of the regularization parameter, was optimal for some intermediate 
value, neither too large nor too small. Armed with these insights, we turn now to a 
discussion of two widely used nonparametric techniques for density estimation, ker- 
nel estimators and nearest neighbours, which have better scaling with dimensionality 
than the simple histogram model. 


3.5.2 Kernel densities 


Let us suppose that observations are being drawn from some unknown probabil- 
ity density p(x) in some D-dimensional space, which we will take to be Euclidean, 
and we wish to estimate the value of p(x). From our earlier discussion of locality, 
let us consider some small region R containing x. The probability mass associated 
with this region is given by 


p= [ p(x) dx. (3.176) 
R 


Now suppose that we have collected a data set comprising N observations drawn 
from p(x). Because each data point has a probability P of falling within R, the total 
number K of points that lie inside R will be distributed according to the binomial 
distribution: Ni 
Bin(K|N, P) = —— P¥(1— P). 3.177 

Using (3.11), we see that the mean fraction of points falling inside the region is 
E[K/N] = P, and similarly using (3.12), we see that the variance around this mean 
is var[ K/N] = P(1 — P)/N. For large N, this distribution will be sharply peaked 
around the mean and so 


K~NP. (3.178) 


If, however, we also assume that the region R is sufficiently small so that the proba- 
bility density p(x) is roughly constant over the region, then we have 


P ~ p(x)V (3.179) 
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where V is the volume of R. Combining (3.178) and (3.179), we obtain our density 


estimate in the form 
pix) = J: 3.180 


Note that the validity of (3.180) depends on two contradictory assumptions, namely 
that the region œR is sufficiently small that the density is approximately constant over 
the region and yet sufficiently large (in relation to the value of that density) that the 
number K of points falling inside the region is sufficient for the binomial distribution 
to be sharply peaked. 

We can exploit the result (3.180) in two different ways. Either we can fix K and 
determine the value of V from the data, which gives rise to the K’-nearest-neighbour 
technique discussed shortly, or we can fix V and determine K from the data, giv- 
ing rise to the kernel approach. It can be shown that both the /’-nearest-neighbour 
density estimator and the kernel density estimator converge to the true probability 
density in the limit N — oo provided that V shrinks with N and that K grows with 
N, at an appropriate rate (Duda and Hart, 1973). 

We begin by discussing the kernel method in detail. To start with we take the 
region 7 to be a small hypercube centred on the point x at which we wish to de- 
termine the probability density. To count the number K of points falling within this 
region, it is convenient to define the following function: 


Ofa, WS i=1,...,D, 
k= { 0, otherwise, ast) 


which represents a unit cube centred on the origin. The function k(u) is an example 
of a kernel function, and in this context, it is also called a Parzen window. From 
(3.181), the quantity k((x — x,,)/h) will be 1 if the data point x, lies inside a cube 
of side h centred on x, and zero otherwise. The total number of data points lying 
inside this cube will therefore be 


N 
K=>0e(*>*). (3.182) 
n=1 


Substituting this expression into (3.180) then gives the following result for the esti- 


mated density at x: 
| x—-x 
= ` k = al 


where we have used V = h?” for the volume of a hypercube of side h in D di- 
mensions. Using the symmetry of the function k(u), we can now reinterpret this 
equation, not as a single cube centred on x but as the sum over N cubes centred on 
the N data points xp. 

As it stands, the kernel density estimator (3.183) will suffer from one of the same 
problems that the histogram method suffered from, namely the presence of artificial 
discontinuities, in this case at the boundaries of the cubes. We can obtain a smoother 
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Illustration of the kernel den- 


sity model (3.184) applied to the 5 h = 0.005 
same data set used to demon- ` 
strate the histogram approach in 


Figure 3.13. We see that h acts 
as a smoothing parameter and 0 0.5 1 
that if it is set too small (top 
panel), the result is a very noisy 
density model, whereas if it is 
set too large (bottom panel), then 
the bimodal nature of the under- 5 
lying distribution from which the h = 0.2 
data is generated (shown by the 
green curve) is washed out. The 
best density model is obtained 0 05 1 
for some intermediate value of h 

(middle panel). 
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density model if we choose a smoother kernel function, and a common choice is the 
Gaussian, which gives rise to the following kernel density model: 


Le 1 |x — Xn||* 
_ SN. 184 
P= D (Qnh2)D72 exp { 2h? l ee 


where h represents the standard deviation of the Gaussian components. Thus, our 
density model is obtained by placing a Gaussian over each data point, adding up the 
contributions over the whole data set, and then dividing by N so that the density 
is correctly normalized. In Figure 3.14, we apply the model (3.184) to the data 
set used earlier to demonstrate the histogram technique. We see that, as expected, 
the parameter h plays the role of a smoothing parameter, and there is a trade-off 
between sensitivity to noise at small h and over-smoothing at large h. Again, the 
optimization of h is a problem in model complexity, analogous to the choice of bin 
width in histogram density estimation or the degree of the polynomial used in curve 
fitting. 

We can choose any other kernel function k(u) in (3.183) subject to the condi- 
tions 


k(u) > 0, (3.185) 
fra = i, (3.186) 


which ensure that the resulting probability distribution is non-negative everywhere 
and integrates to one. The class of density model given by (3.183) is called a kernel 
density estimator or Parzen estimator. It has a great merit that there is no computation 
involved in the ‘training’ phase because this simply requires the training set to be 
stored. However, this is also one of its great weaknesses because the computational 
cost of evaluating the density grows linearly with the size of the data set. 


Figure 3.15 


Exercise 3.38 
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Illustration of K-nearest- 
neighbour density estimation 5 K=1 
using the same data set as in 
Figures 3.14 and 3.13. We see 


that the parameter K governs 0 
the degree of smoothing, so rs 0.5 1 
that a small value of K leads K=5 


to a very noisy density model 
(top panel), whereas a large 


value (bottom panel) smooths we 05 1 
out the bimodal nature of the true 5 i 
distribution (shown by the green K = 30 


curve) from which the data set 
was generated. 


0 0.5 1 


3.5.3 Nearest-neighbours 


One of the difficulties with the kernel approach to density estimation is that the 
parameter h governing the kernel width is fixed for all kernels. In regions of high 
data density, a large value of h may lead to over-smoothing and a washing out of 
structure that might otherwise be extracted from the data. However, reducing h may 
lead to noisy estimates elsewhere in the data space where the density is smaller. 
Thus, the optimal choice for h may be dependent on the location within the data 
space. This issue is addressed by nearest-neighbour methods for density estimation. 

We therefore return to our general result (3.180) for local density estimation, 
and instead of fixing V and determining the value of K from the data, we consider 
a fixed value of K and use the data to find an appropriate value for V. To do this, 
we consider a small sphere centred on the point x at which we wish to estimate the 
density p(x), and we allow the radius of the sphere to grow until it contains precisely 
K data points. The estimate of the density p(x) is then given by (3.180) with V 
set to the volume of the resulting sphere. This technique is known as K nearest 
neighbours and is illustrated in Figure 3.15 for various choices of the parameter K 
using the same data set as used in Figures 3.13 and 3.14. We see that the value of K 
now governs the degree of smoothing and that again there is an optimum choice for 
K that is neither too large nor too small. Note that the model produced by K nearest 
neighbours is not a true density model because the integral over all space diverges. 

We close this chapter by showing how the /v-nearest-neighbour technique for 
density estimation can be extended to the problem of classification. To do this, we 
apply the /-nearest-neighbour density estimation technique to each class separately 
and then make use of Bayes’ theorem. Let us suppose that we have a data set com- 
prising N;, points in class C with N points in total, so that X- k Nrg = N. If we 
wish to classify a new point x, we draw a sphere centred on x containing precisely 
K points irrespective of their class. Suppose this sphere has volume V and contains 
K, points from class Cg. Then (3.180) provides an estimate of the density associated 
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Figure 3.16 (a) In the K-nearest- 
neighbour classifier, a new point, 
shown by the black diamond, is clas- 
sified according to the majority class 
membership of the K closest train- 
ing data points, in this case K = 
3. (b) In the nearest-neighbour 
(K = 1) approach to classification, 
the resulting decision boundary is 
composed of hyperplanes that form 
perpendicular bisectors of pairs of 
points from different classes. 


with each class: 


Kr 
Ck) = —. 3.187 
p(x|Ck) N,V (3.187) 
Similarly, the unconditional density is given by 
K 
= — .l 
p(x) NV (3.188) 
and the class priors are given by 
N 
pC) = F (3.189) 


We can now combine (3.187), (3.188), and (3.189) using Bayes’ theorem to obtain 
the posterior probability of class membership: 


p(x|Cr)p(Ck) _ Ke 


p(Ck|x) = me = (3.190) 


We can minimize the probability of misclassification by assigning the test point x to 
the class having the largest posterior probability, corresponding to the largest value 
of K/K. Thus, to classify a new point, we identify the K nearest points from the 
training data set and then assign the new point to the class having the largest number 
of representatives amongst this set. Ties can be broken at random. The particular 
case of K = 1 is called the nearest-neighbour rule, because a test point is simply 
assigned to the same class as the nearest point from the training set. These concepts 
are illustrated in Figure 3.16. 

An interesting property of the nearest-neighbour (K = 1) classifier is that, in the 
limit N — oo, the error rate is never more than twice the minimum achievable error 
rate of an optimal classifier, i.e., one that uses the true class distributions (Cover and 
Hart, 1967) . 

As discussed so far, both the K-nearest-neighbour method and the kernel den- 
sity estimator require the entire training data set to be stored, leading to expensive 
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computation if the data set is large. This effect can be offset, at the expense of some 
additional one-off computation, by constructing tree-based search structures to allow 
(approximate) near neighbours to be found efficiently without doing an exhaustive 
search of the data set. Nevertheless, these nonparametric methods are still severely 
limited. On the other hand, we have seen that simple parametric models are very 
restricted in terms of the forms of distribution that they can represent. We therefore 
need to find density models that are very flexible and yet for which the complexity 
of the models can be controlled independently of the size of the training set, and this 
can be achieved using deep neural networks. 


(x) Verify that the Bernoulli distribution (3.2) satisfies the following properties: 


1 


ypo = 1 (3.191) 
x=0 
Elz] = p (3.192) 
var[z] = a(l- yp). (3.193) 


Show that the entropy H[z] of a Bernoulli-distributed random binary variable x is 
given by 
H[z] = -u ln u — (1 — u) n(1 — p). (3.194) 


(xx) The form of the Bernoulli distribution given by (3.2) is not symmetric between 
the two values of x. In some situations, it will be more convenient to use an equiva- 
lent formulation for which x € {—1, 1}, in which case the distribution can be written 


(1—a)/2 (1+a) /2 
1- i4 
p(z|u) = = £) (4 £) (3.195) 


where u € [—1, 1]. Show that the distribution (3.195) is normalized, and evaluate its 
mean, variance, and entropy. 


(xx) In this exercise, we prove that the binomial distribution (3.9) is normalized. 
First, use the definition (3.10) of the number of combinations of m identical objects 
chosen from a total of NV to show that 


EC om 


Use this result to prove by induction the following result: 


N 
lta = 5° (*) a (3.197) 


m=0 
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which is known as the binomial theorem and which is valid for all real values of zx. 
Finally, show that the binomial distribution is normalized, so that 


N 


wma wt =, (3.198) 
m 


m=0 


which can be done by first pulling a factor (1 — u)” out of the summation and then 
making use of the binomial theorem. 


(x x) Show that the mean of the binomial distribution is given by (3.11). To do this, 
differentiate both sides of the normalization condition (3.198) with respect to u and 
then rearrange to obtain an expression for the mean of n. Similarly, by differentiating 
(3.198) twice with respect to u and making use of the result (3.11) for the mean of 
the binomial distribution, prove the result (3.12) for the variance of the binomial. 


(x) Show that the mode of the multivariate Gaussian (3.26) is given by pz. 


(xx) Suppose that x has a Gaussian distribution with mean ps and covariance ©. 
Show that the linearly transformed variable Ax + b is also Gaussian, and find its 
mean and covariance. 


(xx x) Show that the Kullback—Leibler divergence between two Gaussian distribu- 
tions q(x) = N (x| Hq; Xq) and p(x) = N (x| up; Ep) is given by 


KL (q(x)||p(x)) 


1f |£ 
=5 {in = DAO, Ey) + (Hp — Ha) Ep = m)} (3.199) 
q 


where Tr(-) denotes the trace of a matrix, and D is the dimensionality of x. 


(xx) This exercise demonstrates that the multivariate distribution with maximum 
entropy, for a given covariance, is a Gaussian. The entropy of a distribution p(x) is 
given by 


H[x] = — fræ In p(x) dx. (3.200) 


We wish to maximize H[x] over all distributions p(x) subject to the constraints that 
p(x) is normalized and that it has a specific mean and covariance, so that 


frw dx=1 (3.201) 
1 p(x)x dx =p (3.202) 
froc — p)(x— p)” dx = X. (3.203) 


By performing a variational maximization of (3.200) and using Lagrange multipliers 
to enforce the constraints (3.201), (3.202), and (3.203), show that the maximum 
likelihood distribution is given by the Gaussian (3.26). 


3.9 
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3.12 
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(xxx) Show that the entropy of the multivariate Gaussian N (x| u, ©) is given by 


D 
2 


mie 5 In| 2 (1 +In(2m)) (3.204) 


where D is the dimensionality of x. 


(x x x) Consider two random variables xı and x2 having Gaussian distributions with 
means 4; and uo and precisions 7, and 72, respectively. Derive an expression for the 
differential entropy of the variable x = xı + x2. To do this, first find the distribution 
of x by using the relation 


p(x) = | pla|va)p(es) dea (3.205) 
and completing the square in the exponent. Then observe that this represents the 
convolution of two Gaussian distributions, which itself will be Gaussian, and finally 
make use of the result (2.99) for the entropy of the univariate Gaussian. 


(x) Consider the multivariate Gaussian distribution given by (3.26). By writing the 
precision matrix (inverse covariance matrix) as the sum of a symmetric and an anti- 
symmetric matrix, show that the antisymmetric term does not appear in the exponent 
of the Gaussian, and hence, that the precision matrix may be taken to be symmetric 
without loss of generality. Because the inverse of a symmetric matrix is also sym- 
metric (see Exercise 3.16), it follows that the covariance matrix may also be chosen 
to be symmetric without loss of generality. 


(x x x) Consider a real, symmetric matrix © whose eigenvalue equation is given by 
(3.28). By taking the complex conjugate of this equation, subtracting the original 
equation, and then forming the inner product with eigenvector u;, show that the 
eigenvalues À; are real. Similarly, use the symmetry property of X to show that two 
eigenvectors u; and u; will be orthogonal provided A; Æ A;. Finally, show that, 
without loss of generality, the set of eigenvectors can be chosen to be orthonormal, 
so that they satisfy (3.29), even if some of the eigenvalues are zero. 


(x x) Show that a real, symmetric matrix 4 having the eigenvector equation (3.28) 
can be expressed as an expansion in the eigenvectors, with coefficients given by the 
eigenvalues, of the form (3.31). Similarly, show that the inverse matrix 5! has a 
representation of the form (3.32). 


(x x) A positive definite matrix © can be defined as one for which the quadratic form 
a’Da (3.206) 


is positive for any real value of the vector a. Show that a necessary and sufficient 
condition for © to be positive definite is that all the eigenvalues A; of X, defined by 
(3.28), are positive. 


(x) Show that a real, symmetric matrix of size D x D has D(D + 1)/2 independent 
parameters. 
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3.16 
3.17 


3.18 


3.19 


3.20 


3.21 


3.22 


(x) Show that the inverse of a symmetric matrix is itself symmetric. 


(x x) By diagonalizing the coordinate system using the eigenvector expansion (3.31), 
show that the volume contained within the hyperellipsoid corresponding to a constant 
Mahalanobis distance A is given by 


Vp|d|1/2A? (3.207) 


where Vp is the volume of the unit sphere in D dimensions, and the Mahalanobis 
distance is defined by (3.27). 


(x x) Prove the identity (3.60) by multiplying both sides by the matrix 


A B 
& D) (3.208) 


and making use of the definition (3.61). 


(x x x) In Sections 3.2.4 and 3.2.5, we considered the conditional and marginal distri- 
butions for a multivariate Gaussian. More generally, we can consider a partitioning 
of the components of x into three groups Xa, Xp, and x,, with a corresponding par- 
titioning of the mean vector u and of the covariance matrix & in the form 


Ha Xaa Dab Dac 
H= Ly ; y= Nba op Moe F (3.209) 
He Yea Leb Yee 


By making use of the results of Section 3.2, find an expression for the conditional 
distribution p(x,|x,) in which x, has been marginalized out. 


(xx) A very useful result from linear algebra is the Woodbury matrix inversion for- 
mula given by 


(A +BCD)"! = A7'— A7'B(C7! + DAB) DA}. (3.210) 
By multiplying both sides by (A + BCD), prove the correctness of this result. 


(x) Let x and z be two independent random vectors, so that p(x,z) = p(x)p(z). 
Show that the mean of their sum y = x + zis given by the sum of the means of each 
of the variables separately. Similarly, show that the covariance matrix of y is given 
by the sum of the covariance matrices of x and z. 


(x x x) Consider a joint distribution over the variable 


xX 
z= a (3.211) 


whose mean and covariance are given by (3.92) and (3.89), respectively. By making 
use of the results (3.76) and (3.77), show that the marginal distribution p(x) is given 
by (3.83). Similarly, by making use of the results (3.65) and (3.66), show that the 
conditional distribution p(y|x) is given by (3.84). 


3.23 


3.24 


3.25 


3.26 


3.27 


3.28 


3.29 


3.30 
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(x x) Using the partitioned matrix inversion formula (3.60), show that the inverse of 
the precision matrix (3.88) is given by the covariance matrix (3.89). 


(xx) By starting from (3.91) and making use of the result (3.89), verify the result 
(3.92). 


(x x) Consider two multi-dimensional random vectors x and z having Gaussian dis- 
tributions p(x) = N (x|ux, Ux) and p(z) = N(z|u,, Uz), respectively, together 
with their sum y = x + z. By considering the linear-Gaussian model comprising 
the product of the marginal distribution p(x) and the conditional distribution p(y|x) 
and making use of the results (3.93) and (3.94), show that the marginal distribution 
of p(y) is given by 

PLY) =N (Yl Me + Mes Bx + £2). (3.212) 


(xxx) This exercise and the next provide practice at manipulating the quadratic 
forms that arise in linear-Gaussian models, and they also serve as an independent 
check of results derived in the main text. Consider a joint distribution p(x, y) de- 
fined by the marginal and conditional distributions given by (3.83) and (3.84). By 
examining the quadratic form in the exponent of the joint distribution and using the 
technique of ‘completing the square’ discussed in Section 3.2, find expressions for 
the mean and covariance of the marginal distribution p(y) in which the variable x 
has been integrated out. To do this, make use of the Woodbury matrix inversion 
formula (3.210). Verify that these results agree with (3.93) and (3.94). 


(x x x) Consider the same joint distribution as in Exercise 3.26, but now use the tech- 
nique of completing the square to find expressions for the mean and covariance of 
the conditional distribution p(x|y). Again, verify that these agree with the corre- 
sponding expressions (3.95) and (3.96). 


(xx) To find the maximum likelihood solution for the covariance matrix of a mul- 
tivariate Gaussian, we need to maximize the log likelihood function (3.102) with 
respect to X, noting that the covariance matrix must be symmetric and positive def- 
inite. Here we proceed by ignoring these constraints and doing a straightforward 
maximization. Using the results (A.21), (A.26), and (A.28) from Appendix A, show 
that the covariance matrix X that maximizes the log likelihood function (3.102) is 
given by the sample covariance (3.106). We note that the final result is necessarily 
symmetric and positive definite (provided the sample covariance is non-singular). 


(x x) Use the result (3.42) to prove (3.46). Now, using the results (3.42) and (3.46), 
show that 


E[XnX,] = uu” + Inm™ (3.213) 


where x,, denotes a data point sampled from a Gaussian distribution with mean p 
and covariance &, and I,,,,, denotes the (n, m) element of the identity matrix. Hence, 
prove the result (3.108). 


(x) The various trigonometric identities used in the discussion of periodic variables 
in this chapter can be proven easily from the relation 


exp(iA) = cos A + isin A (3.214) 
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3.31 


3.32 


3.33 


3.34 


3.35 


3.36 


3.37 


3.38 


in which 7 is the square root of minus one. By considering the identity 
exp(iA) exp(—iA) = 1 (3.215) 
prove the result (3.127). Similarly, using the identity 
cos(A — B) = Rexp{i(A — B)} (3.216) 


where R denotes the real part, prove (3.128). Finally, by using sin(A — B) = 
Sexp{i(A — B)}, where S denotes the imaginary part, prove the result (3.133). 


(x x) For large m, the von Mises distribution (3.129) becomes sharply peaked around 
the mode 69. By defining € = m!/? (0 — 0o) and taking the Taylor expansion of the 
cosine function given by 


a? 
ee = l= 4 O(a") (3.217) 


show that as m —> oo, the von Mises distribution tends to a Gaussian. 


(x) Using the trigonometric identity (3.133), show that solution of (3.132) for 0o is 
given by (3.134). 


(x) By computing the first and second derivatives of the von Mises distribution 
(3.129), and using I9(m) > 0 for m > 0, show that the maximum of the distribution 
occurs when 0 = 6o and that the minimum occurs when 0 = 6) + 7 (mod 27). 


(x) By making use of the result (3.118) together with (3.134) and the trigonometric 
identity (3.128), show that the maximum likelihood solution mmz for the concentra- 
tion of the von Mises distribution satisfies A(mmL ) = 7 where F is the radius of the 
mean of the observations viewed as unit vectors in the two-dimensional Euclidean 
plane, as illustrated in Figure 3.9. 


(x) Verify that the multivariate Gaussian distribution can be cast in exponential fam- 
ily form (3.138), and derive expressions for 7, u(x), h(x), and g(7) analogous to 
(3.164) to (3.167). 


(x) The result (3.172) showed that the negative gradient of In g(7) for the exponential 
family is given by the expectation of u(x). By taking the second derivatives of 
(3.139), show that 


-VV Ing(n) = Efu(x)u(x)*] — E[u(x)JE[u(x)*] = cov[u(x)]. (3.218) 


(x x) Consider a histogram-like density model in which the space x is divided into 
fixed regions for which the density p(x) takes the constant value h; over the ith re- 
gion. The volume of region i is denoted A;. Suppose we have a set of N observations 
of x such that n; of these observations fall in region 7. Using a Lagrange multiplier 
to enforce the normalization constraint on the density, derive an expression for the 
maximum likelihood estimator for the {h;}. 


(x) Show that the A’-nearest-neighbour density model defines an improper distribu- 
tion whose integral over all space is divergent. 


Section 1.2 


Check for 
updates 


Single-layer 
Networks: 
Regression 


In this chapter we discuss some of the basic ideas behind neural networks using the 
framework of linear regression, which we encountered briefly in the context of poly- 
nomial curve fitting. We will see that a linear regression model corresponds to a sim- 
ple form of neural network having a single layer of learnable parameters. Although 
single-layer networks have very limited practical applicability, they have simple an- 
alytical properties and provide an excellent framework for introducing many of the 
core concepts that will lay a foundation for our discussion of deep neural networks 
in later chapters. 
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4. SINGLE-LAYER NETWORKS: REGRESSION 


4.1. Linear Regression 


The goal of regression is to predict the value of one or more continuous target vari- 
ables t given the value of a D-dimensional vector x of input variables. Typically we 
are given a training data set comprising N observations {x,,}, where n = 1,..., N, 
together with corresponding target values {t,,}, and the goal is to predict the value of 
t for a new value of x. To do this, we formulate a function y(x, w) whose values for 
new inputs x constitute the predictions for the corresponding values of t, and where 
w represents a vector of parameters that can be learned from the training data. 

The simplest model for regression is one that involves a linear combination of 
the input variables: 


y(x, w) = wo + wit +... + wpxD (4.1) 


where x = (£1,..., £p)”. The term linear regression sometimes refers specifically 
to this form of model. The key property of this model is that it is a linear function 
of the parameters wo,...,wp. It is also, however, a linear function of the input 
variables x;, and this imposes significant limitations on the model. 


4.1.1 Basis functions 


We can extend the class of models defined by (4.1) by considering linear com- 
binations of fixed nonlinear functions of the input variables, of the form 


M-1 


y(x, w) = Wo + bP wy; (x) (4.2) 


j=1 


where @; (x) are known as basis functions. By denoting the maximum value of the 
index j by M — 1, the total number of parameters in this model will be M. 

The parameter wọ allows for any fixed offset in the data and is sometimes called 
a bias parameter (not to be confused with bias in a statistical sense). It is often 
convenient to define an additional dummy basis function ¢o(x) whose value is fixed 
at ġo (x) = 1 so that (4.2) becomes 


M-1 
y(x,w) = `> wjpj(x) = wex) (4.3) 
j=0 
where w = (wo,...,wm-—1)" and @ = (¢o,..-,@m-1)". We can represent the 


model (4.3) using a neural network diagram, as shown in Figure 4.1. 

By using nonlinear basis functions, we allow the function y(x, w) to be a non- 
linear function of the input vector x. Functions of the form (4.2) are called linear 
models, however, because they are linear in w. Itis this linearity in the parameters 
that will greatly simplify the analysis of this class of models. However, it also leads 
to some significant limitations. 


Figure 4.1 The linear regression model (4.3) can be ex- 
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pressed as a simple neural network diagram ou—1(x) 
involving a single layer of parameters. Here 
each basis function ¢;(x) is represented by 
an input node, with the solid node repre- 


senting the ‘bias’ basis function o, and the 1 (x) 
function y(x, w) is represented by an output 
node. Each of the parameters w; is shown o(x) 


by a line connecting the corresponding basis 
function to the output. 


Before the advent of deep learning it was common practice in machine learning 
to use some form of fixed pre-processing of the input variables x, also known as fea- 
ture extraction, expressed in terms of a set of basis functions {¢;(x)}. The goal was 
to choose a sufficiently powerful set of basis functions that the resulting learning task 
could be solved using a simple network model. Unfortunately, it is very difficult to 
hand-craft suitable basis functions for anything but the simplest applications. Deep 
learning avoids this problem by learning the required nonlinear transformations of 
the data from the data set itself. 

We have already encountered an example of a regression problem when we dis- 
cussed curve fitting using polynomials. The polynomial function (1.1) can be ex- 
pressed in the form (4.3) if we consider a single input variable x and if we choose 
basis functions defined by ¢;(2) = x’. There are many other possible choices for 
the basis functions, for example 


$j(a) = exp (EH) (4.4) 


2s? 


where the ju; govern the locations of the basis functions in input space, and the 

parameter s governs their spatial scale. These are usually referred to as ‘Gaussian’ 

basis functions, although it should be noted that they are not required to have a 

probabilistic interpretation. In particular the normalization coefficient is unimportant 

because these basis functions will be multiplied by learnable parameters wj. 
Another possibility is the sigmoidal basis function of the form 


$j(t) =o (==) (4.5) 


S 


where o(a) is the logistic sigmoid function defined by 


1 


ee ey (4.6) 


o(a) 
Equivalently, we can use the tanh function because this is related to the logistic 
sigmoid by tanh(a) = 2o (2a) — 1, and so a general linear combination of logistic 
sigmoid functions is equivalent to a general linear combination of tanh functions in 
the sense that they can represent the same class of input-output functions. These 
various choices of basis function are illustrated in Figure 4.2. 
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Figure 4.2 Examples of basis functions, showing polynomials on the left, Gaussians of the form (4.4) in the 
centre, and sigmoidal basis functions of the form (4.5) on the right. 


Section 4.1.7 


Section 1.2 


Yet another possible choice of basis function is the Fourier basis, which leads to 
an expansion in sinusoidal functions. Each basis function represents a specific fre- 
quency and has infinite spatial extent. By contrast, basis functions that are localized 
to finite regions of input space necessarily comprise a spectrum of different spatial 
frequencies. In signal processing applications, it is often of interest to consider basis 
functions that are localized in both space and frequency, leading to a class of func- 
tions known as wavelets (Ogden, 1997; Mallat, 1999; Vidakovic, 1999). These are 
also defined to be mutually orthogonal, to simplify their application. Wavelets are 
most applicable when the input values live on a regular lattice, such as the successive 
time points in a temporal sequence or the pixels in an image. 

Most of the discussion in this chapter, however, is independent of the choice of 
basis function set, and so we will not specify the particular form of the basis func- 
tions, except for numerical illustration. Furthermore, to keep the notation simple, we 
will focus on the case of a single target variable t, although we will briefly outline 
the modifications needed to deal with multiple target variables. 


4.1.2 Likelihood function 


We solved the problem of fitting a polynomial function to data by minimizing 
a sum-of-squares error function, and we also showed that this error function could 
be motivated as the maximum likelihood solution under an assumed Gaussian noise 
model. We now return to this discussion and consider the least-squares approach, 
and its relation to maximum likelihood, in more detail. 

As before, we assume that the target variable t is given by a deterministic func- 
tion y(x, w) with additive Gaussian noise so that 


t = y(x,w) +€ (4.7) 


where € is a zero-mean Gaussian random variable with variance c?. Thus, we can 
write 
p(t|x, w, 0?) = N(tly(x, w), o°). (4.8) 
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Now consider a data set of inputs X = {x1, . . . , Xy } with corresponding target 
values t),...,¢. We group the target variables {t,,} into a column vector that we 
denote by t where the typeface is chosen to distinguish it from a single observation 
of a multivariate target, which would be denoted t. Making the assumption that these 
data points are drawn independently from the distribution (4.8), we obtain an expres- 
sion for the likelihood function, which is a function of the adjustable parameters w 
and o°: 


N 
p(t|X, w,o7) = |] N(tnlw7 (xn), 0”) (4.9) 
n=1 


where we have used (4.3). Taking the logarithm of the likelihood function and mak- 
ing use of the standard form (2.49) for the univariate Gaussian, we have 


N 
In p(t|X, w, o°) = `x InN (ta|wT (Xn), 07) 
n=1 


N 1 
J 5 In(27r) — z Ep(w) (4.10) 


where the sum-of-squares error function is defined by 


1 X T 2 
Ep(w) = 5 {tn — wTo(xn)}. (4.11) 


The first two terms in (4.10) can be treated as constants when determining w be- 
cause they are independent of w. Therefore, as we saw previously, maximizing the 
likelihood function under a Gaussian noise distribution is equivalent to minimizing 
the sum-of-squares error function (4.11). 


4.1.3 Maximum likelihood 


Having written down the likelihood function, we can use maximum likelihood 
to determine w and ø?. Consider first the maximization with respect to w. The 
gradient of the log likelihood function (4.10) with respect to w takes the form 


N 
1 
Vw Inp(t|X, w, o?) = a So {tn — wT b(Xn)} O(&n)”. (4.12) 
n=1 
Setting this gradient to zero gives 


N N 
0= )  tn@(xn)” — w™ 2 T . (4.13) 


Solving for w we obtain 
wu = (878) at, (4.14) 
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which are known as the normal equations for the least-squares problem. Here ® is an 
NxM matrix, called the design matrix, whose elements are given by ®,,; = $;(Xn), 


so that 
pox) AZ) e ġm-1(X1) 
J= a $1 (x2) i pan ais 
bo(xn) Ar ++ bar—a(xw) 
The quantity 


pi = (67) aT (4.16) 


is known as the Moore—Penrose pseudo-inverse of the matrix ® (Rao and Mitra, 
1971; Golub and Van Loan, 1996). It can be regarded as a generalization of the no- 
tion of a matrix inverse to non-square matrices. Indeed, if ® is square and invertible, 
then using the property (AB)~! = B~'A7! we see that 6! = 671. 

At this point, we can gain some insight into the role of the bias parameter wo. If 
we make the bias parameter explicit, then the error function (4.11) becomes 


ix M-1 
Ep(w) = 5 Fh — Wo — >D w7o;(Xn)}?. (4.17) 
n=1 j=l 


M-1 
j=l 
where we have defined 
1 Č oo. ie 
i= he bj = N 2 tin). (4.19) 


Thus, the bias wọ compensates for the difference between the averages (over the 
training set) of the target values and the weighted sum of the averages of the basis 
function values. 

We can also maximize the log likelihood function (4.10) with respect to the 
variance o°, giving 


ofn, = N p Din — wu P(Xn)}’, (4.20) 


and so we see that the maximum likelihood value of the variance parameter is given 
by the residual variance of the target values around the regression function. 
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Figure 4.3 Geometrical interpretation of the least- S 
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squares solution in an N-dimensional space 
whose axes are the values of t1,..., tn. The 
least-squares regression function is obtained 
by finding the orthogonal projection of the 
data vector t onto the subspace spanned by 
the basis functions ¢;(x) in which each basis 
function is viewed as a vector ~, of length N 
with elements ¢; (xz). 


Pi y 


4.1.4 Geometry of least squares 


At this point, it is instructive to consider the geometrical interpretation of the 
least-squares solution. To do this, we consider an N-dimensional space whose axes 
are given by the tn, so that t = (t1,... ine is a vector in this space. Each basis 
function ¢;(x,,), evaluated at the VV data points, can also be represented as a vector in 
the same space, denoted by #,, as illustrated in Figure 4.3. Note that p, corresponds 
to the jth column of ®, whereas #(x,,) corresponds to the transpose of the nth row of 
®. If the number M of basis functions is smaller than the number N of data points, 
then the M vectors ġ;(Xn) will span a linear subspace S of dimensionality M. We 
define y to be an N-dimensional vector whose nth element is given by y(xn, w), 
where n = 1,...,N. Because y is an arbitrary linear combination of the vectors 
(p,;, it can live anywhere in the M-dimensional subspace. The sum-of-squares error 
(4.11) is then equal (up to a factor of 1/2) to the squared Euclidean distance between 
y and t. Thus, the least-squares solution for w corresponds to that choice of y that 
lies in subspace S and is closest to t. Intuitively, from Figure 4.3, we anticipate that 
this solution corresponds to the orthogonal projection of t onto the subspace S. This 
is indeed the case, as can easily be verified by noting that the solution for y is given 
by ®w yy and then confirming that this takes the form of an orthogonal projection. 

In practice, a direct solution of the normal equations can lead to numerical diffi- 
culties when ®' ® is close to singular. In particular, when two or more of the basis 
vectors yp, are co-linear, or nearly so, the resulting parameter values can have large 
magnitudes. Such near degeneracies will not be uncommon when dealing with real 
data sets. The resulting numerical difficulties can be addressed using the technique 
of singular value decomposition, or SVD (Deisenroth, Faisal, and Ong, 2020). Note 
that the addition of a regularization term ensures that the matrix is non-singular, even 
in the presence of degeneracies. 


4.1.5 Sequential learning 


The maximum likelihood solution (4.14) involves processing the entire training 
set in one go and is known as a batch method. This can become computationally 
costly for large data sets. If the data set is sufficiently large, it may be worthwhile 
to use sequential algorithms, also known as online algorithms, in which the data 
points are considered one at a time and the model parameters updated after each 
such presentation. Sequential learning is also appropriate for real-time applications 
in which the data observations arrive in a continuous stream and predictions must be 
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made before all the data points are seen. 

We can obtain a sequential learning algorithm by applying the technique of 
stochastic gradient descent, also known as sequential gradient descent, as follows. If 
the error function comprises a sum over data points E = 5°, En, then after presenta- 
tion of data point n, the stochastic gradient descent algorithm updates the parameter 


vector w using 
wt!) = w) — nV En (4.21) 


where 7 denotes the iteration number, and 77 is a suitably chosen learning rate pa- 
rameter. The value of w is initialized to some starting vector w0. For the sum-of- 
squares error function (4.11), this gives 


wt) = w + n(tn — wT, )bn (4.22) 
where @,, = (Xn). This is known as the least-mean-squares or the LMS algorithm. 


4.1.6 Regularized least squares 


We have previously introduced the idea of adding a regularization term to an 
error function to control over-fitting, so that the total error function to be minimized 
takes the form 

Ep(w) + ABw(w) (4.23) 


where A is the regularization coefficient that controls the relative importance of the 
data-dependent error Ep(w) and the regularization term Ey (w). One of the sim- 
plest forms of regularizer is given by the sum of the squares of the weight vector 


elements: i i 
=5 v= zw w (4.24) 
j 


If we also consider the sum-of-squares error function given by 
ix 
= 3) {tn —wTblxn)}?, (4.25) 
n=l 
then the total error function becomes 
pLi wi day) + Aww, (4.26) 


In statistics, this regularizer provides an example of a parameter shrinkage method 
because it shrinks parameter values towards zero. It has the advantage that the error 
function remains a quadratic function of w, and so its exact minimizer can be found 
in closed form. Specifically, setting the gradient of (4.26) with respect to w to zero 
and solving for w as before, we obtain 


w= (1+ 87S) | @"t. (4.27) 


This represents a simple extension of the least-squares solution (4.14). 
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Figure 4.4 Representation of a linear regres- 
sion model as a neural network hav- om -1(x) 
ing a single layer of connections. 
Each basis function is represented 
by a node, with the solid node rep- 
resenting the ‘bias’ basis function o1 (x) 
ġo. Likewise each output y1,..., YK 
is represented by a node. The $0(x) 
links between the nodes represent 
the corresponding weight and bias 
parameters. 


yK (x, w) 


yı (x, w) 


4.1.7 Multiple outputs 


So far, we have considered situations with a single target variable t. In some 
applications, we may wish to predict K > 1 target variables, which we denote col- 
lectively by the target vector t = (tı, ..., t)". This could be done by introducing 
a different set of basis functions for each component of t, leading to multiple, inde- 
pendent regression problems. However, a more common approach is to use the same 
set of basis functions to model all of the components the target vector so that 


y(x, w) = W' (x) (4.28) 


where y is a K-dimensional column vector, W is an M x K matrix of parameters, 
and (x) is an M-dimensional column vector with elements @; (x) with ¢o(x) = 1 
as before. Again, this can be represented as a neural network having a single layer 
of parameters, as shown in Figure 4.4. 

Suppose we take the conditional distribution of the target vector to be an isotropic 
Gaussian of the form 


p(t|x, W, 07) = N(t|W" (x), 071). (4.29) 
If we have a set of observations t;,...,t,, we can combine these into a matrix T 
of size N x K such that the nth row is given by t7. Similarly, we can combine the 


input vectors X1, ..., Xy into a matrix X. The log likelihood function is then given 
by 


N 
Inp(T|X, W, 0°) = X` nN (tp|W7 (xn), 071) 


n=l 
NK "E 7 2 
eS te) ee d, ltn — WTo(xn)|| . (4.30) 
As before, we can maximize this function with respect to W, giving 
Wur = (818) BTT (4.31) 
where we have combined the input feature vectors @(x,),..., (xX) into a matrix 


®. If we examine this result for each target variable tg, we have 


wr = (BTP) BTh, = Sit, (4.32) 
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4.2. 


where t;, is an N-dimensional column vector with components tng for n = 1,... N. 
Thus, the solution to the regression problem decouples between the different target 
variables, and we need compute only a single pseudo-inverse matrix ®', which is 
shared by all the vectors wz. 

The extension to general Gaussian noise distributions having arbitrary covari- 
ance matrices is straightforward. Again, this leads to a decoupling into K inde- 
pendent regression problems. This result is unsurprising because the parameters W 
define only the mean of the Gaussian noise distribution, and we know that the max- 
imum likelihood solution for the mean of a multivariate Gaussian is independent of 
the covariance. From now on, we will therefore consider a single target variable t 
for simplicity. 


Decision theory 


We have formulated the regression task as one of modelling a conditional proba- 
bility distribution p(t|x), and we have chosen a specific form for the conditional 
probability, namely a Gaussian (4.8) with an x-dependent mean y(x, w) governed 
by parameters w and with variance given by the parameter o°. Both w and g? can be 
learned from data using maximum likelihood. The result is a predictive distribution 
given by 

p(t|X, Wax, ciL) = N (tly(x, Wa), oL). (4.33) 


The predictive distribution expresses our uncertainty over the value of ¢ for some 
new input x. However, for many practical applications we need to predict a specific 
value for t rather than returning an entire distribution, particularly where we must 
take a specific action. For example, if our goal is to determine the optimal level of 
radiation to use for treating a tumour and our model predicts a probability distri- 
bution over radiation dose, then we must use that distribution to decide the specific 
dose to be administered. Our task therefore breaks down into two stages. In the first 
stage, called the inference stage, we use the training data to determine a predictive 
distribution p(t|x). In the second stage, known as the decision stage, we use this 
predictive distribution to determine a specific value f(x), which will be dependent 
on the input vector x, that is optimal according to some criterion. We can do this 
by minimizing a loss function that depends on both the predictive distribution p(t|x) 
and on f. 

Intuitively we might choose the mean of the conditional distribution, so that 
we would use f(x) = y(x, Wy). In some cases this intuition will be correct, but 
in other situations it can give very poor results. It is therefore useful to formalize 
this so that we can understand when it applies and under what assumptions, and the 
framework for doing this is called decision theory. 

Suppose that we choose a value f(x) for our prediction when the true value is 
t. In doing so, we incur some form of penalty or cost. This is determined by a 
loss, which we denote L(t, f(x)). Of course, we do not know the true value of t, so 
instead of minimizing L itself, we minimize the average, or expected, loss which is 


Figure 4.5 The regression function f*(x), 
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which minimizes the expected 
squared loss, is given by the 
mean of the conditional distribu- 
tion p(t|x). 


given by 


E[L] = J L(t, f (x))p(x, t) dx dt (4.34) 


where we are averaging over the distribution of both input and target variables, 
weighted by their joint distribution p(x,t). A common choice of loss function in 
regression problems is the squared loss given by L(t, f(x)) = {f (x) — t}’. In this 
case, the expected loss can be written 


E[L] = I { f (x) — t} p(x, t) dx dt. (4.35) 


It is important not to confuse the squared-loss function with the sum-of-squares 
error function introduced earlier. The error function is used to set the parameters 
during training in order to determine the conditional probability distribution p(t|x), 
whereas the loss function governs how the conditional distribution is used to arrive 
at a predictive function f(x) specifying a prediction for each value of x. 

Our goal is to choose f(x) so as to minimize E[L]. If we assume a completely 
flexible function f(x), we can do this formally using the calculus of variations to 
give 


EIL] _ 
Of (x) 


Solving for f(x) and using the sum and product rules of probability, we obtain 


2 f G-a. (4.36) 


*(x) = ie x = x = E,|t\|x 
(0) = penas | treat = Biles), 43D 


which is the conditional average of t conditioned on x and is known as the regression 
function. This result is illustrated in Figure 4.5. It can readily be extended to multiple 
target variables represented by the vector t, in which case the optimal solution is the 
conditional average f*(x) = E;[t|x]. For a Gaussian conditional distribution of the 
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form (4.8), the conditional mean will be simply 


Eft 


x| = fwaw dt = y(x, w). (4.38) 


The use of calculus of variations to derive (4.37) implies that we are optimiz- 
ing over all possible functions f(x). Although any parametric model that we can 
implement in practice is limited in the range of functions that it can represent, the 
framework of deep neural networks, discussed extensively in later chapters, provides 
a highly flexible class of functions that, for many practical purposes, can approxi- 
mate any desired function to high accuracy. 

We can derive this result in a slightly different way, which will also shed light 
on the nature of the regression problem. Armed with the knowledge that the optimal 
solution is the conditional expectation, we can expand the square term as follows 


{ f(x) — t} = {f(x) - Efe x] — t}° 
= {f(«) — Elt|x]}? + 2{ f(x) — Elt|x] HEt 
where, to keep the notation uncluttered, we use E[t|x] to denote E;[¢|x]. Substituting 


into the loss function (4.35) and performing the integral over t, we see that the cross- 
term vanishes and we obtain an expression for the loss function in the form 


x] — t}? 


x] — t} + {E[t 


EE] = f (f(x) - Ele 


x|} p(x) dx + fva [t|x] p(x) dx. (4.39) 


The function f(x) we seek to determine appears only in the first term, which will be 
minimized when f(x) is equal to E[¢|x], in which case this term will vanish. This is 
simply the result that we derived previously, and shows that the optimal least-squares 
predictor is given by the conditional mean. The second term is the variance of the 
distribution of t, averaged over x, and represents the intrinsic variability of the target 
data and can be regarded as noise. Because it is independent of f(x), it represents 
the irreducible minimum value of the loss function. 

The squared loss is not the only possible choice of loss function for regression. 
Here we consider briefly one simple generalization of the squared loss, called the 
Minkowski loss, whose expectation is given by 


ElL] = / | f(x) — t|%p(x, t) dx dt, (4.40) 


which reduces to the expected squared loss for q = 2. The function |f — t|2 is 
plotted against f — t for various values of q in Figure 4.6. The minimum of E[L,] is 
given by the conditional mean for q = 2, the conditional median for q = 1, and the 
conditional mode for q — 0. 

Note that the Gaussian noise assumption implies that the conditional distribution 
of t given x is unimodal, which may be inappropriate for some applications. In 
this case a squared loss can lead to very poor results and we need to develop more 
sophisticated approaches. For example, we can extend this model by using mixtures 
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Figure 4.6 Plots of the quantity L = |f — t|? for various values of q. 

Section 6.5 of Gaussians to give multimodal conditional distributions, which often arise in the 
solution of inverse problems. Our focus in this section has been on decision theory 
for regression problems, and in the next chapter we shall develop analogous concepts 

Section 5.2 for classification tasks. 


4.3. The Bias—Variance Trade-off 


So far in our discussion of linear models for regression, we have assumed that the 
Section 1.2 form and number of basis functions are both given. We have also seen that the use 
of maximum likelihood can lead to severe over-fitting if complex models are trained 
using data sets of limited size. However, limiting the number of basis functions 
to avoid over-fitting has the side effect of limiting the flexibility of the model to 
capture interesting and important trends in the data. Although a regularization term 
can control over-fitting for models with many parameters, this raises the question of 
how to determine a suitable value for the regularization coefficient A. Seeking the 
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solution that minimizes the regularized error function with respect to both the weight 
vector w and the regularization coefficient A is clearly not the right approach, since 
this leads to the unregularized solution with \ = 0. 

It is instructive to consider a frequentist viewpoint of the model complexity is- 
sue, known as the bias—variance trade-off. Although we will introduce this concept 
in the context of linear basis function models, where it is easy to illustrate the ideas 
using simple examples, the discussion has very general applicability. Note, however, 
that over-fitting is really an unfortunate property of maximum likelihood and does 
not arise when we marginalize over parameters in a Bayesian setting (Bishop, 2006). 

When we discussed decision theory for regression problems, we considered var- 
ious loss functions, each of which leads to a corresponding optimal prediction once 
we are given the conditional distribution p(t|x). A popular choice is the squared-loss 
function, for which the optimal prediction is given by the conditional expectation, 
which we denote by h(x) and is given by 


h(x) = Eft 


x] = [wt dt. (4.41) 


We have also seen that the expected squared loss can be written in the form 


r= f1- x)}? p(x Jax + f {h(x) — t} p(x, t)dxdt. (4.42) 


Recall that the second term, which is independent of f(x), arises from the intrin- 
sic noise on the data and represents the minimum achievable value of the expected 
loss. The first term depends on our choice for the function f(x), and we will seek a 
solution for f(x) that makes this term a minimum. Because it is non-negative, the 
smallest value that we can hope to achieve for this term is zero. If we had an unlim- 
ited supply of data (and unlimited computational resources), we could in principle 
find the regression function h(x) to any desired degree of accuracy, and this would 
represent the optimal choice for f(x). However, in practice we have a data set D 
containing only a finite number N of data points, and consequently, we cannot know 
the regression function h(x) exactly. 

If we were to model h(x) using a function governed by a parameter vector w, 
then from a Bayesian perspective, the uncertainty in our model would be expressed 
through a posterior distribution over w. A frequentist treatment, however, involves 
making a point estimate of w based on the data set D and tries instead to interpret the 
uncertainty of this estimate through the following thought experiment. Suppose we 
had a large number of data sets each of size N and each drawn independently from 
the distribution p(t, x). For any given data set D, we can run our learning algorithm 
and obtain a prediction function f(x; D). Different data sets from the ensemble will 
give different functions and consequently different values of the squared loss. The 
performance of a particular learning algorithm is then assessed by taking the average 
over this ensemble of data sets. 

Consider the integrand of the first term in (4.42), which for a particular data set 
D takes the form 

{f(x;D) — hax). (4.43) 
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Because this quantity will be dependent on the particular data set D, we take its aver- 
age over the ensemble of data sets. If we add and subtract the quantity Ep| f(x; D)| 
inside the braces, and then expand, we obtain 


{ f(x; D) — Ep[f (x; D)] + Ep[f (x; D)] — h(x) }? 
= {f (x; D) — Ep[f(x; D)]}? + {Ep[f (x; D)] - h(x)}? 
+ 2{ f(x; D) — Ep[f(x; D)] }{Ep[ f(x; D)] — h(x}. (4.44) 


We now take the expectation of this expression with respect to D and note that the 
final term will vanish, giving 


Ep [{(x;D) — h(x)}*] 


iS 


= {Ep|f(x;D)] — h(x)? + Ep [{f(x;D) — Ep[f(x; D)]}’] . (4.45) 
(bias)? variance 


We see that the expected squared difference between f(x;D) and the regression 
function h(x) can be expressed as the sum of two terms. The first term, called the 
squared bias, represents the extent to which the average prediction over all data sets 
differs from the desired regression function. The second term, called the variance, 
measures the extent to which the solutions for individual data sets vary around their 
average, and hence, this measures the extent to which the function f(x;D) is sen- 
sitive to the particular choice of data set. We will provide some intuition to support 
these definitions shortly when we consider a simple example. 

So far, we have considered a single input value x. If we substitute this expansion 
back into (4.42), we obtain the following decomposition of the expected squared 
loss: 


expected loss = (bias)? + variance + noise (4.46) 

where 
(bias)? = f {Ep|F(2%:D)] — h(o)}*pC) ax (447 
variance = 1 Ep [{f(x; D) — Ep[f (x; D)]}7] p(x) dx (4.48) 


noise = JJew — t}°p(x, t) dx dt (4.49) 


and the bias and variance terms now refer to integrated quantities. 

Our goal is to minimize the expected loss, which we have decomposed into the 
sum of a (squared) bias, a variance, and a constant noise term. As we will see, there is 
a trade-off between bias and variance, with very flexible models having low bias and 
high variance, and relatively rigid models having high bias and low variance. The 
model with the optimal predictive capability is the one that leads to the best balance 
between bias and variance. This is illustrated by considering the sinusoidal data set 
introduced earlier. Here we independently generate 100 data sets, each containing 
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N = 25 data points, from the sinusoidal curve h(a) = sin(27a). The data sets are 
indexed by l = 1,..., L, where L = 100. For each data set D”, we fit a model 
with M = 24 Gaussian basis functions along with a constant ‘bias’ basis function to 
give a total of 25 parameters. By minimizing the regularized error function (4.26), 
we obtain a prediction function f“ (x), as shown in Figure 4.7. 

The top row corresponds to a large value of the regularization coefficient that 
gives low variance (because the red curves in the left plot look similar) but high 
bias (because the two curves in the right plot are very different). Conversely on 
the bottom row, for which A is small, there is large variance (shown by the high 
variability between the red curves in the left plot) but low bias (shown by the good 
fit between the average model fit and the original sinusoidal function). Note that 
the result of averaging many solutions for the complex model with M = 25 is a 
very good fit to the regression function, which suggests that averaging may be a 
beneficial procedure. Indeed, a weighted averaging of multiple solutions lies at the 
heart of a Bayesian approach, although the averaging is with respect to the posterior 
distribution of parameters, not with respect to multiple data sets. 

We can also examine the bias—variance trade-off quantitatively for this example. 
The average prediction is estimated from 


1 L 
= =_ 24 fO (z (4.50) 


and the integrated squared bias and integrated variance are then given by 


N 
(bias)? = w D P) - h(n) ¥ (4.51) 
mesis isiyo Tey 4.52 
variance = NISU (tn) — f(an)} (4.52) 


where the integral over x, weighted by the distribution p(x), is approximated by a 
finite sum over data points drawn from that distribution. These quantities, along with 
their sum, are plotted as a function of ln in Figure 4.8. We see that small values 
of À allow the model to become finely tuned to the noise on each individual data set 
leading to large variance. Conversely, a large value of A pulls the weight parameters 
towards zero leading to large bias. 

Note that the bias—variance decomposition is of limited practical value because 
it is based on averages with respect to ensembles of data sets, whereas in practice 
we have only the single observed data set. If we had a large number of independent 
training sets of a given size, we would be better off combining them into a single 
larger training set, which of course would reduce the level of over-fitting for a given 
model complexity. Nevertheless, the bias—variance decomposition often provides 
useful insights into the model complexity issue, and although we have introduced it 
in this chapter from the perspective of regression problems, the underlying intuition 
has broad applicability. 
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Figure 4.7 Illustration of the dependence of bias and variance on model complexity governed by a regulariza- 
tion parameter à, using the sinusoidal data from Chapter 1. There are L = 100 data sets, each having N = 25 
data points, and there are 24 Gaussian basis functions in the model so that the total number of parameters is 
M = 25 including the bias parameter. The left column shows the result of fitting the model to the data sets for 
various values of In A (for clarity, only 20 of the 100 fits are shown). The right column shows the corresponding 
average of the 100 fits (red) along with the sinusoidal function from which the data sets were generated (green). 


128 4. SINGLE-LAYER NETWORKS: REGRESSION 


Figure 4.8 Plot of squared bias and vari- 0.25 
ance, together with their sum, correspond- 
ing to the results shown in Figure 4.7. Also 


shown is the average test set error for a (bias)? 
test data set size of 1,000 points. The min- i 
imum value of (bias)? + variance occurs —— variance 
around ln à = 0.43, which is close to the — (bias)? + variance 
value that gives the minimum error on the hast 
test data. 4} SOUEEN 
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Exercises 


4.1 


4.2 


4.3 


(x) Consider the sum-of-squares error function given by (1.2) in which the function 
y(x, w) is given by the polynomial (1.1). Show that the coefficients w = {w;} that 
minimize this error function are given by the solution to the following set of linear 
equations: 


M 
X Ayw = Tj (4.53) 
j=0 
where 
N N 
Aga Sa, T; = \(@n)*tn. (4.54) 
n=1 n=1 


Here a suffix i or j denotes the index of a component, whereas (x) denotes x raised 
to the power of i. 


(x) Write down the set of coupled linear equations, analogous to (4.53), satisfied by 
the coefficients w; that minimize the regularized sum-of-squares error function given 
by (1.4). 


(x) Show that the tanh function defined by 


e*—e % 
tanh(a) = ———— 4.55 
anh(a) = S a (4.55) 
and the logistic sigmoid function defined by (4.6) are related by 
tanh(a) = 2o (2a) — 1. (4.56) 


Hence, show that a general linear combination of logistic sigmoid functions of the 
form 


M 
ylz, w) = wo + `> wjo (=#) (4.57) 


j=1 


4.4 


4.5 


4.6 


4.7 
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is equivalent to a linear combination of tanh functions of the form 


M 
L— py 
y(x, u) = uo + 2 uj tanh (=) (4.58) 
and find expressions to relate the new parameters {u1,..., um } to the original pa- 
rameters {w1,..., was}. 
(x x x) Show that the matrix 
(PTP) tp" (4.59) 


takes any vector v and projects it onto the space spanned by the columns of ®. Use 
this result to show that the least-squares solution (4.14) corresponds to an orthogonal 
projection of the vector t onto the manifold S, as shown in Figure 4.3. 


(x) Consider a data set in which each data point tn is associated with a weighting 
factor r» > 0, so that the sum-of-squares error function becomes 


N 
Ep(w) = ; iD Tn Stn — w'(Xn)} i (4.60) 
n=1 


Find an expression for the solution w* that minimizes this error function. Give two 
alternative interpretations of the weighted sum-of-squares error function in terms of 
(i) data-dependent noise variance and (ii) replicated data points. 


(x) By setting the gradient of (4.26) with respect to w to zero, show that the exact 
minimum of the regularized sum-of-squares error function for linear regression is 
given by (4.27). 


(x x) Consider a linear basis function regression model for a multivariate target vari- 
able t having a Gaussian distribution of the form 


p(t|W, £) = N(tly(x, W), 5) (4.61) 


where 
y(x, W) = W' (x) (4.62) 


together with a training data set comprising input basis vectors @(x,,) and corre- 
sponding target vectors tn, with n = 1,..., N. Show that the maximum likelihood 
solution Ww for the parameter matrix W has the property that each column is 
given by an expression of the form (4.14), which was the solution for an isotropic 
noise distribution. Note that this is independent of the covariance matrix ©. Show 
that the maximum likelihood solution for X is given by 


N 


D= DO (tn — Wii. (Xn) (tn — Wii b(%n)) (4.63) 


n=1 
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4.8 


4.9 


4.10 
4.11 


4.12 


(x) Consider the generalization of the squared-loss function (4.35) for a single target 
variable t to multiple target variables described by the vector t given by 


BIL (t, £(x))] = 1 £(x) — tl|2p(x, t) dxdt. (4.64) 


Using the calculus of variations, show that the function f (x) for which this expected 
loss is minimized is given by 


f(x) = E,[t|x]. (4.65) 


(x) By expansion of the square in (4.64), derive a result analogous to (4.39) and, 
hence, show that the function f(x) that minimizes the expected squared loss for a 
vector t of target variables is again given by the conditional expectation of t in the 
form (4.65). 


(x x) Rederive the result (4.65) by first expanding (4.64) analogous to (4.39). 
(x x) The following distribution 


p(alo?, q) = 1 exp ( -2E (4.66) 
? 2(20?)/1T(1/q) 202 ` 


is a generalization of the univariate Gaussian distribution. Here T(x) is the gamma 
function defined by 


T(x) = f u” le™" du. (4.67) 
Show that this distribution is normalized so that 
f p(alo?,q) dx = 1 (4.68) 


and that it reduces to the Gaussian when q = 2. Consider a regression model in 
which the target variable is given by t = y(x, w) +€ and € is a random noise variable 
drawn from the distribution (4.66). Show that the log likelihood function over w and 
o”, for an observed data set of input vectors X = {x,,...,x,} and corresponding 
target variables t = (t,,...,t)7, is given by 


N 
1 N 
In p(t|X, w, o°) = Er > ly(xn, Ww) — tn|1— F In(20°) + const (4.69) 
n=1 


where ‘const’ denotes terms independent of both w and o?. Note that, as a function 
of w, this is the L, error function considered in Section 4.2. 


(x x) Consider the expected loss for regression problems under the Lq loss function 
given by (4.40). Write down the condition that y(x) must satisfy to minimize E[L,]. 
Show that, for g = 1, this solution represents the conditional median, i.e., the func- 
tion y(x) such that the probability mass for t < y(x) is the same as for t > y(x). 
Also show that the minimum expected Lg loss for q — 0 is given by the conditional 
mode, i.e., by the function y(x) being equal to the value of t that maximizes p(t|x) 
for each x. 


Check for 
updates 


Single-layer 
Networks: 
Classification 


In the previous chapter, we explored a class of regression models in which the out- 
put variables were linear functions of the model parameters and which can therefore 
be expressed as simple neural networks having a single layer of weight and bias 
parameters. We turn now to a discussion of classification problems, and in this chap- 
ter, we will focus on an analogous class of models that again can be expressed as 
single-layer neural networks. These will allow us to introduce many of the key con- 
cepts of classification before dealing with more general deep neural networks in later 


chapters. 
The goal in classification is to take an input vector x € R” and assign it to one 
of K discrete classes Cy where k = 1,..., K. In the most common scenario, the 


classes are taken to be disjoint, so that each input is assigned to one and only one 
class. The input space is thereby divided into decision regions whose boundaries are 
called decision boundaries or decision surfaces. In this chapter, we consider linear 
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Section 5.2.4 


5.1. 


models for classification, by which we mean that the decision surfaces are linear 
functions of the input vector x and, hence, are defined by (D — 1)-dimensional 
hyperplanes within the D-dimensional input space. Data sets whose classes can 
be separated exactly by linear decision surfaces are said to be linearly separable. 
Linear classification models can be applied to data sets that are not linearly separable, 
although not all inputs will be correctly classified. 

We can broadly identify three distinct approaches to solving classification prob- 
lems. The simplest involves constructing a discriminant function that directly assigns 
each vector x to a specific class. A more powerful approach, however, models the 
conditional probability distributions p(C;,|x) in an inference stage and subsequently 
uses these distributions to make optimal decisions. Separating inference and deci- 
sion brings numerous benefits. There are two different approaches to determining 
the conditional probabilities p(C;,|x). One technique is to model them directly, for 
example by representing them as parametric models and then optimizing the param- 
eters using a training set. This will be called a discriminative probabilistic model. 
Alternatively, we can model the class-conditional densities p(x|C;,), together with 
the prior probabilities p(C;,) for the classes, and then compute the required posterior 
probabilities using Bayes’ theorem: 


P(X|C)P(Ci) 


Renn om) 


(5.1) 
This will be called a generative probabilistic model because it offers the opportunity 
to generate samples from each of the class-conditional densities p(x|C;,). In this 
chapter, we will discuss examples of all three approaches: discriminant functions, 
generative probabilistic models, and discriminative probabilistic models. 


Discriminant Functions 


A discriminant is a function that takes an input vector x and assigns it to one of K 
classes, denoted Cx. In this chapter, we will restrict attention to linear discriminants, 
namely those for which the decision surfaces are hyperplanes. To simplify the dis- 
cussion, we consider first two classes and then investigate the extension to K > 2 
classes. 


5.1.1 Two classes 


The simplest representation of a linear discriminant function is obtained by tak- 
ing a linear function of the input vector so that 


y(x) = wx + wo (5.2) 


where w is called a weight vector, and wg is a bias (not to be confused with bias in 
the statistical sense). An input vector x is assigned to class C if y(x) > 0 and to 
class C2 otherwise. The corresponding decision boundary is therefore defined by the 
relation y(x) = 0, which corresponds to a (D — 1)-dimensional hyperplane within 
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Figure 5.1 Illustration of the geometry of 
a linear discriminant function in two dimen- 
sions. The decision surface, shown in red, 
is perpendicular to w, and its displacement 
from the origin is controlled by the bias pa- 
rameter wo. Also, the signed orthogonal 
distance of a general point x from the deci- 
sion surface is given by y(x)/||w]|. 


the D-dimensional input space. Consider two points x, and Xp both of which lie on 
the decision surface. Because y(xa) = y(xp) = 0, we have wT (xa — Xp) = 0 and 
hence the vector w is orthogonal to every vector lying within the decision surface, 
and so w determines the orientation of the decision surface. Similarly, if x is a point 
on the decision surface, then y(x) = 0, and so the normal distance from the origin 
to the decision surface is given by 


SS eh EE (5.3) 


We therefore see that the bias parameter wo determines the location of the decision 
surface. These properties are illustrated for the case of D = 2 in Figure 5.1. 

Furthermore, note that the value of y(x) gives a signed measure of the perpen- 
dicular distance r of the point x from the decision surface. To see this, consider an 
arbitrary point x and let x be its orthogonal projection onto the decision surface, 
so that W 

x=x, +r—. (5.4) 
lwil 

Multiplying both sides of this result by wT and adding wo, and making use of y(x) = 
wx + wo and y(x1) = wTx1 + wo = 0, we have 


r= 1, (5.5) 
[wl 
This result is illustrated in Figure 5.1. 
Section 4.1.1 As with linear regression models, it is sometimes convenient to use a more com- 
pact notation in which we introduce an additional dummy ‘input’ value xo = 1 and 
then define w = (wo, w) and x = (xo, x) so that 


X. (5.6) 


y(x) =w" 
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Figure 5.2 Attempting to construct a K-class discriminant from a set of two-class discriminant functions leads 
to ambiguous regions, as shown in green. On the left is an example with two discriminant functions designed to 
distinguish points in class C; from points not in class Cx. On the right is an example involving three discriminant 
functions each of which is used to separate a pair of classes C; and C;. 


In this case, the decision surfaces are D-dimensional hyperplanes passing through 
the origin of the (D + 1)-dimensional expanded input space. 


5.1.2 Multiple classes 


Now consider the extension of linear discriminant functions to K > 2 classes. 
We might be tempted to build a K-class discriminant by combining a number of 
two-class discriminant functions. However, this leads to some serious difficulties 
(Duda and Hart, 1973), as we now show. 

Consider a model with K — 1 classifiers, each of which solves a two-class prob- 
lem of separating points in a particular class C;, from points not in that class. This 
is known as a one-versus-the-rest classifier. The left-hand example in Figure 5.2 
shows an example involving three classes where this approach leads to regions of 
input space that are ambiguously classified. 

An alternative is to introduce k(t — 1)/2 binary discriminant functions, one 
for every possible pair of classes. This is known as a one-versus-one classifier. Each 
point is then classified according to a majority vote amongst the discriminant func- 
tions. However, this too runs into the problem of ambiguous regions, as illustrated 
in the right-hand diagram of Figure 5.2. 

We can avoid these difficulties by considering a single K-class discriminant 
comprising K linear functions of the form 


Ye(X) = WEX + wko (5.7) 


and then assigning a point x to class Cy if y(x) > y;(x) for all j A k. The decision 
boundary between class Cx and class C; is therefore given by y(x) = y;(x) and 
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Illustration of the decision regions for a 
multi-class linear discriminant, with the 
decision boundaries shown in red. If 
two points xa and xz both lie inside the 
same decision region Rg, then any point 
R that lies on the line connecting these 
two points must also lie in Rx, and hence, 
the decision region must be singly con- 
nected and convex. 


hence corresponds to a (D — 1)-dimensional hyperplane defined by 
(Wk — w;)'x + (wko — Wyo) = 0. (5.8) 


This has the same form as the decision boundary for the two-class case discussed in 
Section 5.1.1, and so analogous geometrical properties apply. 

The decision regions of such a discriminant are always singly connected and 
convex. To see this, consider two points x4 and xg both of which lie inside decision 
region Rg, as illustrated in Figure 5.3. Any point X that lies on the line connecting 
xa and xp can be expressed in the form 


X = Axa + (1— A)xp (5.9) 
where 0 < A < 1. From the linearity of the discriminant functions, it follows that 
yk (&) = Aye (Xa) + (1 — A)ye (Xp). (5.10) 


Because both x, and xg lie inside Ry, it follows that yk(xa) > y;(xa) and that 
Yk(XB) > y;(Xp), for all j A k, and hence y;,(X) > y;(X), and so X also lies inside 
Ry. Thus, Rx is singly connected and convex. 

Note that for two classes, we can either employ the formalism discussed here, 
based on two discriminant functions y;(x) and yo(x), or else use the simpler but 
essentially equivalent formulation based on a single discriminant function y(x). 


5.1.3 1-of-k coding 


For regression problems, the target variable t was simply the vector of real num- 
bers whose values we wish to predict. In classification, there are various ways of 
using target values to represent class labels. For two-class problems, the most conve- 
nient is the binary representation in which there is a single target variable t € {0,1} 
such that t = 1 represents class Cı and t = 0 represents class C2. We can interpret 
the value of t as the probability that the class is C,, with the probability values taking 
only the extreme values of 0 and 1. For K > 2 classes, it is convenient to use a 
1-of-k coding scheme, also known as the one-hot encoding scheme, in which t is 
a vector of length K such that if the class is C;, then all elements tọ of t are zero 
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except element tj, which takes the value 1. For instance, if we have K = 5 classes, 
then a data point from class 2 would be given the target vector 


t = (0, 1,0,0,0)”. (5.11) 


Again, we can interpret the value of tų as the probability that the class is C;, in which 
the probabilities take only the values 0 and 1. 


5.1.4 Least squares for classification 


With linear regression models, the minimization of a sum-of-squares error func- 
tion leads to a simple closed-form solution for the parameter values. It is therefore 
tempting to see if we can apply the same least-squares formalism to classification 
problems. Consider a general classification problem with K classes and a 1-of-K 
binary coding scheme for the target vector t. One justification for using least squares 
in such a context is that it approximates the conditional expectation E/t|x] of the 
target values given the input vector. For a binary coding scheme, this conditional ex- 
pectation is given by the vector of posterior class probabilities. Unfortunately, these 
probabilities are typically approximated rather poorly, and indeed the approxima- 
tions can have values outside the range (0, 1). However, it is instructional to explore 
these simple models and to understand how these limitations arise. 

Each class Cx is described by its own linear model so that 


yk(x) = wx + Wko (5.12) 
where k = 1,..., K. We can conveniently group these together using vector nota- 
tion so that ps 

y(x) = WTX (5.13) 


where W is a matrix whose kth column comprises the (D + 1)-dimensional vector 
Wr = (wko, w} )" and X is the corresponding augmented input vector (1, xT)" with 
a dummy input z) = 1. A new input x is then assigned to the class for which the 
output yg = Ww, X is largest. 

We now determine the parameter matrix W by minimizing a sum-of-squares 
error function. Consider a training data set {x,,t,,} where n = 1,..., N, and 


define a matrix T whose nth row is the vector t7, together with a matrix X whose 
nth row is XT. The sum-of-squares error function can then be written as 


Ep(W) = sir {RW —T)(xw- T)} (5.14) 


Setting the derivative with respect to W to zero and rearranging, we obtain the solu- 
tion for W in the form 


W = (XTX) XTT = XİT (5.15) 


where Xt is the pseudo-inverse of the matrix X. We then obtain the discriminant 
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function in the form M 


y(x) = W'x = TT (x') X. (5.16) 


An interesting property of least-squares solutions with multiple target variables 
is that if every target vector in the training set satisfies some linear constraint 


a't, +b=0 (5.17) 


for some constants a and b, then the model prediction for any value of x will satisfy 
the same constraint so that 
aTy(x)+b=0. (5.18) 


Thus, if we use a l-of-K coding scheme for K classes, then the predictions made 
by the model will have the property that the elements of y(x) will sum to 1 for any 
value of x. However, this summation constraint alone is not sufficient to allow the 
model outputs to be interpreted as probabilities because they are not constrained to 
lie within the interval (0, 1). 

The least-squares approach gives an exact closed-form solution for the discrim- 
inant function parameters. However, even as a discriminant function (where we use 
it to make decisions directly and dispense with any probabilistic interpretation), it 
suffers from some severe problems. We have seen that the sum-of-squares error 
function can be viewed as the negative log likelihood under the assumption of a 
Gaussian noise distribution. If the true distribution of the data is markedly different 
from being Gaussian, then least squares can give poor results. In particular, least 
squares is very sensitive to the presence of outliers, which are data points located a 
long way from the bulk of the data. This is illustrated in Figure 5.4. Here we see that 
the additional data points in the right-hand figure produce a significant change in the 
location of the decision boundary, even though these points would be correctly clas- 
sified by the original decision boundary in the left-hand figure. The sum-of-squares 
error function gives too much weight to data points that are a long way from the 
decision boundary, even though they are correctly classified. Outliers can arise due 
to rare events or may simply be due to mistakes in the data set. Techniques that are 
sensitive to a very few data points are said to lack robustness. For comparison, Fig- 
ure 5.4 also shows results from a technique called logistic regression, which is more 
robust to outliers. 

The failure of least squares should not surprise us when we recall that it cor- 
responds to maximum likelihood under the assumption of a Gaussian conditional 
distribution, whereas binary target vectors clearly have a distribution that is far from 
Gaussian. By adopting more appropriate probabilistic models, we can obtain clas- 
sification techniques with much better properties than least squares, and which can 
also be generalized to give flexible nonlinear neural network models, as we will see 
in later chapters. 
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Figure 5.4 The left-hand plot shows data from two classes, denoted by red crosses and blue circles, together 
with the decision boundaries found by least squares (magenta curve) and by a logistic regression model (green 
curve). The right-hand plot shows the corresponding results obtained when extra data points are added at the 
bottom right of the diagram, showing that least squares is highly sensitive to outliers, unlike logistic regression. 


Section 4.2 


5.2. Decision Theory 


When we discussed linear regression we saw how the process of making predictions 
in machine learning can be broken down into the two stages of inference and de- 
cision. We now explore this perspective in much greater depth specifically in the 
context of classifiers. 

Suppose we have an input vector x together with a corresponding vector t of 
target variables, and our goal is to predict t given a new value for x. For regression 
problems, t will comprise continuous variables and in general will be a vector as 
we may wish to predict several related quantities. For classification problems, t will 
represent class labels. Again, t will in general be a vector if we have more than 
two classes. The joint probability distribution p(x, t) provides a complete summary 
of the uncertainty associated with these variables. Determining p(x, t) from a set 
of training data is an example of inference and is typically a very difficult problem 
whose solution forms the subject of much of this book. In a practical application, 
however, we must often make a specific prediction for the value of t or more gen- 
erally take a specific action based on our understanding of the values t is likely to 
take, and this aspect is the subject of decision theory. 

Consider, for example, our earlier medical diagnosis problem in which we have 
taken an image of a skin lesion on a patient, and we wish to determine whether the 
patient has cancer. In this case, the input vector x is the set of pixel intensities in 
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the image, and the output variable t will represent the absence of cancer, which we 
denote by the class C4, or the presence of cancer, which we denote by the class C2. 
We might, for instance, choose t to be a binary variable such that t = 0 corresponds 
to class Cı and t = 1 corresponds to class C2. We will see later that this choice of 
label values is particularly convenient when working with probabilities. The gen- 
eral inference problem then involves determining the joint distribution p(x, Cx), or 
equivalently p(x, t), which gives us the most complete probabilistic description of 
the variables. Although this can be a very useful and informative quantity, ultimately, 
we must decide either to give treatment to the patient or not, and we would like this 
choice to be optimal according to some appropriate criterion (Duda and Hart, 1973). 
This is the decision step, and the aim of decision theory is that it should tell us how 
to make optimal decisions given the appropriate probabilities. We will see that the 
decision stage is generally very simple, even trivial, once we have solved the in- 
ference problem. Here we give an introduction to the key ideas of decision theory 
as required for the rest of the book. Further background, as well as more detailed 
accounts, can be found in Berger (1985) and Bather (2000). 

Before giving a more detailed analysis, let us first consider informally how we 
might expect probabilities to play a role in making decisions. When we obtain the 
skin image x for a new patient, our goal is to decide which of the two classes to assign 
the image to. We are therefore interested in the probabilities of the two classes, given 
the image, which are given by p(C;,|x). Using Bayes’ theorem, these probabilities 
can be expressed in the form 


P(x|Cx)p(Cr) l 


p(Ck|x) = P 


(5.19) 


Note that any of the quantities appearing in Bayes’ theorem can be obtained from 
the joint distribution p(x, Cp) by either marginalizing or conditioning with respect to 
the appropriate variables. We can now interpret p(C;,) as the prior probability for the 
class C, and p(C,|x) as the corresponding posterior probability. Thus, p(C,) repre- 
sents the probability that a person has cancer, before the image is taken. Similarly, 
p(C,|x) is the posterior probability, revised using Bayes’ theorem in light of the in- 
formation contained in the image. If our aim is to minimize the chance of assigning 
x to the wrong class, then intuitively we would choose the class having the higher 
posterior probability. We now show that this intuition is correct, and we also discuss 
more general criteria for making decisions. 


5.2.1 Misclassification rate 


Suppose that our goal is simply to make as few misclassifications as possible. 
We need a rule that assigns each value of x to one of the available classes. Such a 
rule will divide the input space into regions Rẹ called decision regions, one for each 
class, such that all points in Rẹ are assigned to class Cp. The boundaries between 
decision regions are called decision boundaries or decision surfaces. Note that each 
decision region need not be contiguous but could comprise some number of disjoint 
regions. To find the optimal decision rule, consider first the case of two classes, as in 
the cancer problem, for instance. A mistake occurs when an input vector belonging 
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to class C; is assigned to class C2 or vice versa. The probability of this occurring is 
given by 


p(mistake) = p(x € Ri, C2) + p(x € Re, C1) 


= f plo,C2) dx+ f p(x, C,) dx. (5.20) 
Ri Reo 


We are free to choose the decision rule that assigns each point x to one of the 
two classes. Clearly, to minimize p(mistake) we should arrange that each x is as- 
signed to whichever class has the smaller value of the integrand in (5.20). Thus, if 
p(x,Ci) > p(x,C2) for a given value of x, then we should assign that x to class 
Cı. From the product rule of probability, we have p(x, Ck) = p(Cx|x)p(x). Because 
the factor p(x) is common to both terms, we can restate this result as saying that 
the minimum probability of making a mistake is obtained if each value of x is as- 
signed to the class for which the posterior probability p(C;,|x) is largest. This result 
is illustrated for two classes and a single input variable x in Figure 5.5. 

For the more general case of K classes, it is slightly easier to maximize the 
probability of being correct, which is given by 


K 
p(correct) = X v(x E Rise) 
k=1 


K 
= > f p(x, Cr) dx, (5.21) 
kai” Re 


which is maximized when the regions R; are chosen such that each x is assigned 
to the class for which p(x, Cx) is largest. Again, using the product rule p(x, Cp) = 
p(Cx|x)p(x), and noting that the factor of p(x) is common to all terms, we see 
that each x should be assigned to the class having the largest posterior probability 


p(Cx |x). 
5.2.2 Expected loss 


For many applications, our objective will be more complex than simply mini- 
mizing the number of misclassifications. Let us consider again the medical diagnosis 
problem. We note that, if a patient who does not have cancer is incorrectly diagnosed 
as having cancer, the consequences may be that they experience some distress plus 
there is the need for further investigations. Conversely, if a patient with cancer is 
diagnosed as healthy, the result may be premature death due to lack of treatment. 
Thus, the consequences of these two types of mistake can be dramatically different. 
It would clearly be better to make fewer mistakes of the second kind, even if this was 
at the expense of making more mistakes of the first kind. 

We can formalize such issues through the introduction of a loss function, also 
called a cost function, which is a single, overall measure of loss incurred in taking 
any of the available decisions or actions. Our goal is then to minimize the total loss 
incurred. Note that some authors consider instead a utility function, whose value 
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Figure 5.5 Schematic illustration of the joint probabilities p(x, C+) for each of two classes plotted against x, 
together with the decision boundary x = Z. Values of x > @ are classified as class C2 and hence belong to 
decision region R2, whereas points x < @ are classified as Cı and belong to Rı. Errors arise from the blue, 
green, and red regions, so that for x < Ẹ, the errors are due to points from class C2 being misclassified as Cı 
(represented by the sum of the red and green regions). Conversely for points in the region x > £, the errors are 
due to points from class C, being misclassified as C2 (represented by the blue region). By varying the location 
T of the decision boundary, as indicated by the red double-headed arrow in (a), the combined areas of the blue 
and green regions remains constant, whereas the size of the red region varies. The optimal choice for £ is where 
the curves for p(x,Ci) and p(x,C2) cross, as shown in (b) and corresponding to = xo, because in this case 
the red region disappears. This is equivalent to the minimum misclassification rate decision rule, which assigns 
each value of x to the class having the higher posterior probability p(C;,.|z). 
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Figure 5.6 An example of a loss matrix with elements normal cancer 
Lx; for the cancer treatment problem. The rows cor- 

respond to the true class, whereas the columns corre- normal 0 1 
spond to the assignment of class made by our decision cancer 100 0 
criterion. 


they aim to maximize. These are equivalent concepts if we take the utility to be 
simply the negative of the loss. Throughout this text we will use the loss function 
convention. Suppose that, for a new value of x, the true class is Cy and that we assign 
x to class C; (where j may or may not be equal to k). In so doing, we incur some 
level of loss that we denote by Lkj, which we can view as the k, j element of a loss 
matrix. For instance, in our cancer example, we might have a loss matrix of the form 
shown in Figure 5.6. This particular loss matrix says that there is no loss incurred if 
the correct decision is made, there is a loss of 1 if a healthy patient is diagnosed as 
having cancer, whereas there is a loss of 100 if a patient having cancer is diagnosed 
as healthy. 

The optimal solution is the one that minimizes the loss function. However, the 
loss function depends on the true class, which is unknown. For a given input vector x, 
our uncertainty in the true class is expressed through the joint probability distribution 
p(x, Cp), and so we seek instead to minimize the average loss, where the average is 
computed with respect to this distribution and is given by 


e= A f, Lyjp(x, Cr) dx. (5.22) 
k j j 


Each x can be assigned independently to one of the decision regions Rj. Our goal 
is to choose the regions R; to minimize the expected loss (5.22), which implies that 
for each x, we should minimize X`, Lkjp(x, Cp). As before, we can use the product 
rule p(x,C,) = p(Cx|x)p(x) to eliminate the common factor of p(x). Thus, the 
decision rule that minimizes the expected loss assigns each new x to the class 7 for 
which the quantity 


S > Lejp(Celx) (5.23) 
k 


is a minimum. Once we have chosen values for the loss matrix elements Lgj, this is 
clearly trivial to do. 


5.2.3 The reject option 


We have seen that classification errors arise from the regions of input space 
where the largest of the posterior probabilities p(C;,|x) is significantly less than unity 
or equivalently where the joint distributions p(x, C) have comparable values. These 
are the regions where we are relatively uncertain about class membership. In some 
applications, it will be appropriate to avoid making decisions on the difficult cases 
in anticipation of obtaining a lower error rate on those examples for which a classi- 
fication decision is made. This is known as the reject option. For example, in our 
hypothetical cancer screening example, it may be appropriate to use an automatic 
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Illustration of the reject option. Inputs p(Ci|2) 
x such that the larger of the two poste- 1.0 
rior probabilities is less than or equalto 4 
some threshold 0 will be rejected. 


0.0 


reject region 


system to classify those images for which there is little doubt as to the correct class, 
while requesting a biopsy to classify the more ambiguous cases. We can achieve this 
by introducing a threshold 0 and rejecting those inputs x for which the largest of 
the posterior probabilities p(C;,|x) is less than or equal to 8. This is illustrated for 
two classes and a single continuous input variable x in Figure 5.7. Note that setting 
0 = 1 will ensure that all examples are rejected, whereas if there are K classes, then 
setting 0 < 1/K will ensure that no examples are rejected. Thus, the fraction of 
examples that are rejected is controlled by the value of 0. 

We can easily extend the reject criterion to minimize the expected loss, when a 
loss matrix is given, by taking account of the loss incurred when a reject decision is 
made. 


5.2.4 Inference and decision 


We have broken the classification problem down into two separate stages, the 
inference stage in which we use training data to learn a model for p(C;,|x) and the 
subsequent decision stage in which we use these posterior probabilities to make op- 
timal class assignments. An alternative possibility would be to solve both problems 
together and simply learn a function that maps inputs x directly into decisions. Such 
a function is called a discriminant function. 

In fact, we can identify three distinct approaches to solving decision problems, 
all of which have been used in practical applications. These are, in decreasing order 
of complexity, as follows: 


(a) First, solve the inference problem of determining the class-conditional den- 
sities p(x|C;,) for each class C% individually. Separately infer the prior class 
probabilities p(C;,). Then use Bayes’ theorem in the form 


p(x|Cx)p(Cx) 


P(Cx|x) = p(x) 


(5.24) 


to find the posterior class probabilities p(C;,|x). As usual, the denominator in 
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Bayes’ theorem can be found in terms of the quantities in the numerator, using 


p(x) = $ p(x|Cx)p(Cr). (5.25) 
k 


Equivalently, we can model the joint distribution p(x,C;,,) directly and then 
normalize to obtain the posterior probabilities. Having found the posterior 
probabilities, we use decision theory to determine the class membership for 
each new input x. Approaches that explicitly or implicitly model the distribu- 
tion of inputs as well as outputs are known as generative models, because by 
sampling from them, it is possible to generate synthetic data points in the input 
space. 


(b 


~— 


First, solve the inference problem of determining the posterior class probabili- 
ties p(C,|x), and then subsequently use decision theory to assign each new x to 
one of the classes. Approaches that model the posterior probabilities directly 
are called discriminative models. 


(c 


— 


Find a function f(x), called a discriminant function, that maps each input x 
directly onto a class label. For instance, for two-class problems, f(-) might be 
binary valued and such that f = 0 represents class C4 and f = 1 represents 
class C2. In this case, probabilities play no role. 


Let us consider the relative merits of these three alternatives. Approach (a) is the 
most demanding because it involves finding the joint distribution over both x and 
Cp. For many applications, x will have high dimensionality, and consequently, we 
may need a large training set to be able to determine the class-conditional densities to 
reasonable accuracy. Note that the class priors p(C;,) can often be estimated simply 
from the fractions of the training set data points in each of the classes. One advantage 
of approach (a), however, is that it also allows the marginal density of data p(x) to 
be determined from (5.25). This can be useful for detecting new data points that 
have low probability under the model and for which the predictions may be of low 
accuracy, which is known as outlier detection or novelty detection (Bishop, 1994; 
Tarassenko, 1995). 

However, if we wish only to make classification decisions, then it can be waste- 
ful of computational resources and excessively demanding of data to find the joint 
distribution p(x, C;,) when in fact we really need only the posterior probabilities 
p(Cx|x), which can be obtained directly through approach (b). Indeed, the class- 
conditional densities may contain a significant amount of structure that has little ef- 
fect on the posterior probabilities, as illustrated in Figure 5.8. There has been much 
interest in exploring the relative merits of generative and discriminative approaches 
to machine learning and in finding ways to combine them (Jebara, 2004; Lasserre, 
Bishop, and Minka, 2006). 

An even simpler approach is (c) in which we use the training data to find a 
discriminant function f(x) that maps each x directly onto a class label, thereby 
combining the inference and decision stages into a single learning problem. In the 
example of Figure 5.8, this would correspond to finding the value of x shown by 
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Figure 5.8 Example of the class-conditional densities for two classes having a single input variable x (left 
plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the 
class-conditional density p(x|C1), shown in blue on the left plot, has no effect on the posterior probabilities. The 
vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification 
rate, assuming the prior class probabilities, p(C1) and p(C2), are equal. 


the vertical green line, because this is the decision boundary giving the minimum 
probability of misclassification. 

With option (c), however, we no longer have access to the posterior probabilities 
p(Cx|x). There are many powerful reasons for wanting to compute the posterior 
probabilities, even if we subsequently use them to make decisions. These include: 


Minimizing risk. Consider a problem in which the elements of the loss matrix are 
subjected to revision from time to time (such as might occur in a financial 
application). If we know the posterior probabilities, we can trivially revise the 
minimum risk decision criterion by modifying (5.23) appropriately. If we have 
only a discriminant function, then any change to the loss matrix would require 
that we return to the training data and solve the inference problem afresh. 


Reject option. Posterior probabilities allow us to determine a rejection criterion that 
will minimize the misclassification rate, or more generally the expected loss, 
for a given fraction of rejected data points. 


Section 2.1.1 Compensating for class priors. Consider our cancer screening example again, and 
suppose that we have collected a large number of images from the general pop- 
ulation for use as training data, which we use to build an automated screening 
system. Because cancer is rare amongst the general population, we might find 
that, say, only 1 in every 1,000 examples corresponds to the presence of cancer. 
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If we used such a data set to train an adaptive model, we could run into severe 
difficulties due to the small proportion of those in the cancer class. For in- 
stance, a classifier that assigned every point to the normal class would achieve 
99.9% accuracy, and it may be difficult to avoid this trivial solution. Also, even 
a large data set will contain very few examples of skin images corresponding 
to cancer, and so the learning algorithm will not be exposed to a broad range 
of examples of such images and hence is not likely to generalize well. A bal- 
anced data set with equal numbers of examples from each of the classes would 
allow us to find a more accurate model. However, we then have to compensate 
for the effects of our modifications to the training data. Suppose we have used 
such a modified data set and found models for the posterior probabilities. From 
Bayes’ theorem (5.24), we see that the posterior probabilities are proportional 
to the prior probabilities, which we can interpret as the fractions of points in 
each class. We can therefore simply take the posterior probabilities obtained 
from our artificially balanced data set, divide by the class fractions in that data 
set, and then multiply by the class fractions in the population to which we wish 
to apply the model. Finally, we need to normalize to ensure that the new poste- 
rior probabilities sum to one. Note that this procedure cannot be applied if we 
have learned a discriminant function directly instead of determining posterior 
probabilities. 


Combining models. For complex applications, we may wish to break the problem 


into a number of smaller sub-problems each of which can be tackled by a sep- 
arate module. For example, in our hypothetical medical diagnosis problem, 
we may have information available from, say, blood tests as well as skin im- 
ages. Rather than combine all of this heterogeneous information into one huge 
input space, it may be more effective to build one system to interpret the im- 
ages and a different one to interpret the blood data. If each of the two models 
gives posterior probabilities for the classes, then we can combine the outputs 
systematically using the rules of probability. One simple way to do this is to 
assume that, for each class separately, the distributions of inputs for the im- 
ages, denoted by xz, and the blood data, denoted by xp, are independent, so 
that 

p(x1,XB|Cx) = p(x1|Cx)p(xB|Cx). (5.26) 


This is an example of a conditional independence property, because the in- 
dependence holds when the distribution is conditioned on the class C. The 
posterior probability, given both the image and blood data, is then given by 


P(Ck|x1, XB) œ p(X1,xB|Ck)p(Cx) 
x p(x1|Cx)p(xB|Cr)p(Cr) 

p(Ck|xi)p(Ck|xB) 

p(Ck) l 


Thus, we need the class prior probabilities p(C;,), which we can easily estimate 
from the fractions of data points in each class, and then we need to normalize 


(5.27) 
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Figure 5.9 The confusion matrix for the cancer treat- normal cancer 
ment problem, in which the rows correspond to the true 

class and the columns correspond to the assignment normal / Nr Nrp 
of class made by our decision criterion. The elements cancer Nen Nrp 


of the matrix show the numbers of true negatives, false 
positives, false negatives, and true positives. 
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the resulting posterior probabilities so they sum to one. The particular condi- 
tional independence assumption (5.26) is an example of a naive Bayes model. 
Note that the joint marginal distribution p(xz, xg) will typically not factorize 
under this model. We will see in later chapters how to construct models for 
combining data that do not require the conditional independence assumption 
(5.26). A further advantage of using models that output probabilities rather 
than decisions is that they can easily be made differentiable with respect to 
any adjustable parameters (such as the weight coefficients in the polynomial 
regression example), which allows them to be composed and trained jointly 
using gradient-based optimization methods. 


5.2.5 Classifier accuracy 


The simplest measure of performance for a classifier is the fraction of test set 
points that are correctly classified. However, we have seen that different types of 
error can have different consequences, as expressed through the loss matrix, and 
often we therefore do not simply wish to minimize the number of misclassifications. 
By changing the location of the decision boundary, we can make trade-offs between 
different kinds of error, for example with the goal of minimizing an expected loss. 
Because this is such an important concept, we will introduce some definitions and 
terminology so that the performance of a classifier can be better characterized. 

We will consider again our cancer screening example. For each person tested, 
there is a ‘true label’ of whether they have cancer or not, and there is also the predic- 
tion made by the classifier. If, for a particular person, the classifier predicts cancer 
and this is in fact the true label, then the prediction is called a true positive. How- 
ever, if the person does not have cancer it is a false positive. Likewise, if the classifier 
predicts that a person does not have cancer and this is correct, then the prediction is 
called a true negative, otherwise it is a false negative. The false positives are also 
known as type 1 errors whereas the false negatives are called type 2 errors. If N is 
the total number of people taking the test, then Np is the number of true positives, 
Npp is the number of false positives, Nx is the number of true negatives, and Npn 
is the number of false negatives, where 


N = Nre + Nre + Non + New. (5.28) 


This can be represented as a confusion matrix as shown in Figure 5.9. Accuracy, 
measured by the fraction of correct classifications, is then given by 


Accuracy = 3 (5.29) 
y Nre + Nep + Non + New 
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We can see that accuracy can be misleading if there are strongly imbalanced classes. 
In our cancer screening example, for instance, where only 1 person in 1,000 has 
cancer, a naive classifier that simply decides that nobody has cancer will achieve 
99.9% accuracy and yet is completely useless. 

Several other quantities can be defined in terms of these numbers, of which the 
most commonly encountered are 


es Nrp 
Precision = ————_—___- (5.30) 
Nrp + Npp 
Nrp 
Recall = — —____ (5.31) 
TP + Nen 
False positive rate = a (5.32) 
Nrp + Now 
. Nfp 
False discovery rate = ————— (5.33) 
r Nrp + Nre 


In our cancer screening example, precision represents an estimate of the probability 
that a person who has a positive test does indeed have cancer, whereas recall is an 
estimate of the probability that a person who has cancer is correctly detected by 
the test. The false positive rate is an estimate of the probability that a person who is 
normal will be classified as having cancer, whereas the false discovery rate represents 
the fraction of those testing positive who do not in fact have cancer. 

By altering the location of the decision boundary, we can change the trade-offs 
between the two kinds of errors. To understand this trade-off, we revisit Figure 5.5, 
but now we label the various regions as shown in Figure 5.10. We can relate the 
labelled regions to the various true and false rates as follows: 


Npp/N =E (5.34) 
Nrp/N=D+E (5.35) 
Npn/N = B+C (5.36) 
Nrn/N =A+C (5.37) 


where we are implicitly considering the limit N — oo so that we can relate number 
of observations to probabilities. 


5.2.6 ROC curve 


A probabilistic classifier will output a posterior probability, which can be con- 
verted to a decision by setting a threshold. As the value of the threshold is varied, we 
can reduce type | errors at the expense of increasing type 2 errors, or vice versa. To 
better understand this trade-off, it is useful to plot the receiver operating characteris- 
tic or ROC curve (Fawcett, 2006), a name that originates from procedures to measure 
the performance of radar receivers. This is a graph of true positive rate versus false 
positive rate, as shown in Figure 5.11. 

As the decision boundary in Figure 5.10 is moved from —oo to oo, the ROC 
curve is traced out and can then be generated by plotting the cumulative fraction of 
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Figure 5.10 As in Figure 5.5, with the various regions labelled. In the cancer classification problem, region Ri 
is assigned to the normal class whereas region R2 is assigned to the cancer class. 


correct detection of cancer on the y-axis versus the cumulative fraction of incorrect 
detection on the x-axis. Note that a specific confusion matrix represents one point 
along the ROC curve. The best possible classifier would be represented by a point at 
the top left corner of the ROC diagram. The bottom left corner represents a simple 
classifier that assigns every point to the normal class and therefore has no true posi- 
tives but also no false positives. Similarly, the top right corner represents a classifier 
that assigns everything to the cancer class and therefore has no false negatives but 
also no true negatives. In Figure 5.11, the classifiers represented by the blue curve 
are better than those of the red curve for any choice of, say, false positive rate. It 
is also possible, however, for such curves to cross over, in which case the choice of 
which is better will depend on the choice of operating point. 

As a baseline, we can consider a random classifier that simply assigns each data 
point to cancer with probability p and to normal with probability 1 — p. As we vary 
the value of p it will trace out an ROC curve given by a diagonal straight line, as 
shown in Figure 5.11. Any classifier below the diagonal line performs worse than 
random guessing. 

Sometimes it is useful to have a single number that characterises the whole ROC 
curve. One approach is to measure the area under the curve (AUC). A value of 0.5 
for the AUC represents random guessing whereas a value of 1.0 represents a perfect 
classifier. 

Another measure is the F-score, which is the geometric mean of precision and 
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Figure 5.11 The receiver operator characteristic 1 
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(ROC) curve is a plot of true positive 
rate against false positive rate, and 
it characterizes the trade-off between 
type 1 and type 2 errors in a classifi- 
cation problem. The upper blue curve 


ve 2 
represents a better classifier than the 
lower red curve. Here the dashed g 
curve represents the performance of = 
a simple random classifier. 3 
© 
= 
=E 
0 

0 False positive rate 1 


recall, and is therefore defined by 


PH 2x pescisien x recall (5.38) 
precision + recall 


pi 2Nrp 
~ 2Nrp + Nep + New’ 


(5.39) 


Of course, we can also combine the confusion matrix in Figure 5.9 with the loss ma- 
trix in Figure 5.6 to compute the expected loss by multiplying the elements pointwise 
and summing the resulting products. 

Although the ROC curve can be extended to more than two classes, it rapidly 
becomes cumbersome as the number of classes increases. 


Generative Classifiers 


We turn next to a probabilistic view of classification and show how models with 
linear decision boundaries arise from simple assumptions about the distribution of 
the data. We have already discussed the distinction between the discriminative and 
the generative approaches to classification. Here we will adopt a generative approach 
in which we model the class-conditional densities p(x|C;,) as well as the class priors 
p(C;,) and then use these to compute posterior probabilities p(C;,|x) through Bayes’ 
theorem. 

First, consider problems having two classes. The posterior probability for class 
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Figure 5.12 Plot of the logistic sigmoid 
function o(a) defined by 
(5.42), shown in red, together 
with the scaled probit function 
&(Aa), for AÀ? = 7/8, shown 
in dashed blue, where ®(a) 
is defined by (5.86). The 
scaling factor 7/8 is chosen 
so that the derivatives of 
the two curves are equal for 
a= 0. 


C, can be written as 


P(x|C1)p(Ci) 
P(x|C1)p(C1) + p(x|C2)p(C2) 
1 


> Tee o(a) (5.40) 


p(Ci|x) 


where we have defined 


a= in PERC) (Cr) (5.41) 


p(x|C2)p(C2) 
and o(a) is the logistic sigmoid function defined by 


1 


sIr (5.42) 


a(a) 
which is plotted in Figure 5.12. The term ‘sigmoid’ means S-shaped. This type of 
function is sometimes also called a ‘squashing function’ because it maps the whole 
real axis into a finite interval. The logistic sigmoid has been encountered already 
in earlier chapters and plays an important role in many classification algorithms. It 
satisfies the following symmetry property: 


o(—a) = 1 — g(a) (5.43) 


as is easily verified. The inverse of the logistic sigmoid is given by 


a=mn( 2 ) (5.44) 
l-o 


and is known as the logit function. It represents the log of the ratio of probabilities 
In [p(C1|x) /p(C2|x)]| for the two classes, also known as the log odds. 

Note that in (5.40), we have simply rewritten the posterior probabilities in an 
equivalent form, and so the appearance of the logistic sigmoid may seem artificial. 
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However, it will have significance provided a(x) has a constrained functional form. 
We will shortly consider situations in which a(x) is a linear function of x, in which 
case the posterior probability is governed by a generalized linear model. 

If there are K > 2 classes, we have 


P(x|Cx)p(Cx) 
>; p(x|C;)p(C;) 
exp(ax) 


= + 5.45 
Z; exp(ay) =) 


which is known as the normalized exponential and can be regarded as a multi-class 
generalization of the logistic sigmoid. Here the quantities a, are defined by 


ak = In (p(x|Ck)p(Ck)) - (5.46) 


p(Cr|x) 


The normalized exponential is also known as the softmax function, as it represents 
a smoothed version of the ‘max’ function because, if a, >> a; for all j Æ k, then 
p(Cx|x) = 1, and p(C;|x) ~ 0. 

We now investigate the consequences of choosing specific forms for the class- 
conditional densities, looking first at continuous input variables x and then dis- 
cussing briefly discrete inputs. 


5.3.1 Continuous inputs 


Let us assume that the class-conditional densities are Gaussian. We will then ex- 
plore the resulting form for the posterior probabilities. To start with, we will assume 
that all classes share the same covariance matrix X. Thus, the density for class C;, is 
given by 


1 1 1 
p(x|Ck) = OPA EA exp {-3 — u) 3 (x — m)} ; (5.47) 


First, suppose that we have two classes. From (5.40) and (5.41), we have 


p(Ci|x) = o(w'x + wo) (5.48) 
where we have defined 
w = E(u — m) (5.49) 
l to- l to- p(Cı) 
= —-p,™ =u; X l ; ; 
Wo ag Hı + g F2 H + In p(C2) (5.50) 


We see that the quadratic terms in x from the exponents of the Gaussian densities 
have cancelled (due to the assumption of common covariance matrices), leading to 
a linear function of x in the argument of the logistic sigmoid. This result is illus- 
trated for a two-dimensional input space x in Figure 5.13. The resulting decision 
boundaries correspond to surfaces along which the posterior probabilities p(C;,|x) 
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Figure 5.13 The left-hand plot shows the class-conditional densities for two classes, denoted red and blue. 
On the right is the corresponding posterior probability p(Ci|x), which is given by a logistic sigmoid of a linear 
function of x. The surface in the right-hand plot is coloured using a proportion of red ink given by p(C:|x) anda 
proportion of blue ink given by p(C2|x) = 1 — p(C1|x). 


are constant and so will be given by linear functions of x, and therefore the decision 
boundaries are linear in input space. The prior probabilities p(C;,) enter only through 
the bias parameter wo, so that changes in the priors have the effect of making par- 
allel shifts of the decision boundary and more generally of the parallel contours of 
constant posterior probability. 

For the general case of K classes, the posterior probabilities are given by (5.45) 
where, from (5.46) and (5.47), we have 


ak(x) = WEX + wko (5.51) 

in which we have defined 
= Se (5.52) 
wko = -IRE py + Inp(Cx). (5.53) 


We see that the a;,(x) are again linear functions of x as a consequence of the cancel- 
lation of the quadratic terms due to the shared covariances. The resulting decision 
boundaries, corresponding to the minimum misclassification rate, will occur when 
two of the posterior probabilities (the two largest) are equal, and so will be defined 
by linear functions of x. Thus, we again have a generalized linear model. 

If we relax the assumption of a shared covariance matrix and allow each class- 
conditional density p(x|C;,) to have its own covariance matrix Xx, then the earlier 
cancellations will no longer occur, and we will obtain quadratic functions of x, giv- 
ing rise to a quadratic discriminant. The linear and quadratic decision boundaries 
are illustrated in Figure 5.14. 


5.3.2 Maximum likelihood solution 


Once we have specified a parametric functional form for the class-conditional 
densities p(x|C;), we can then determine the values of the parameters, together with 
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Figure 5.14 The left-hand plot shows the class-conditional densities for three classes each having a Gaussian 
distribution, coloured red, green, and blue, in which the red and blue classes have the same covariance matrix. 
The right-hand plot shows the corresponding posterior probabilities, in which each point on the image is coloured 
using proportions of red, blue, and green ink corresponding to the posterior probabilities for the respective 
three classes. The decision boundaries are also shown. Notice that the boundary between the red and blue 
classes, which have the same covariance matrix, is linear, whereas those between the other pairs of classes are 
quadratic. 


the prior class probabilities p(C;,), using maximum likelihood. This requires a data 
set comprising observations of x along with their corresponding class labels. 

First, suppose we have two classes, each having a Gaussian class-conditional 
density with a shared covariance matrix, and suppose we have a data set {x,t } 
where n = 1,..., N. Here t,, = 1 denotes class Cı and t,, = 0 denotes class Cz. We 
denote the prior class probability p(C,) = 7, so that p(C2) = 1 — r. For a data point 
Xn from class C1, we have t,, = 1 and hence 


P(Xn, C1) = p(Ci)p(Xn|C1) = TN (Xn| by, X). 
Similarly for class C2, we have t,, = 0 and hence 
P(Xn, C2) = p(C2)p(Xn|C2) = (1 — T)N (Xn| H2, X). 


Thus, the likelihood function is given by 


N 
p(t, Xr, pi, My, D) = | [| N&ni, E)” (= rN (Xn |g, BY) (5.54) 
n=1 
where t = (t1,...,t¢ n)”. As usual, it is convenient to maximize the log of the 


likelihood function. Consider first the maximization with respect to 7. The terms in 
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the log likelihood function that depend on z are 


N 


S {inlor + (1 — ta) n(1 — 7)}. (5.55) 


n=1 


Setting the derivative with respect to 7 equal to zero and rearranging, we obtain 


N 
1 N; N, 
ee p= = SS . 
x N> ae er (5.56) 


where N denotes the total number of data points in class Cı, and Nə denotes the 
total number of data points in class C2. Thus, the maximum likelihood estimate 
for m is simply the fraction of points in class C4 as expected. This result is easily 
generalized to the multi-class case where again the maximum likelihood estimate of 
the prior probability associated with class C;, is given by the fraction of the training 
set points assigned to that class. 

Now consider the maximization with respect to u. Again, we can pick out of 
the log likelihood function those terms that depend on p: 


N N 
1 
) tn WN (X_| fj, X) = = > tn(Xn — H) ET Xn — H1) + const. (5.57) 
n=1 


n=1 


Setting the derivative with respect to jz, to zero and rearranging, we obtain 


jp Y afn (5.58) 


which is simply the mean of all the input vectors x„ assigned to class Cı. By a 
similar argument, the corresponding result for p, is given by 


N 
1 
= — 1—ty)Xn, (5.59) 
H2 Nə 2 ) 


which again is the mean of all the input vectors x,, assigned to class C2. 

Finally, consider the maximum likelihood solution for the shared covariance 
matrix X. Picking out the terms in the log likelihood function that depend on ©, we 
have 


i ii 
Tyo-1 
-32 tnn |E] — 5D taen — Ha) x (Xn — H) 


N N 
1 1 z 
-3 D (l = tn) n |B] — 5 DO = tn) Kn — Ma) TET (xn — Ha) 
n=1 n=1 
N N _ 
a In |E] 3 Tr {= *S} (5.60) 
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where we have defined 


N: N. 
S = gite (5.61) 
1 
Sı = N, (Xn — p1 )(Xn — Ha)? (5.62) 
1 
neCy 
1 T 
S: = = (Xn — M2)(Kn — H2). (5.63) 
Nə nEC2 


Using the standard result for the maximum likelihood solution for a Gaussian distri- 
bution, we see that X = S, which represents a weighted average of the covariance 
matrices associated with each of the two classes separately. 

This result is easily extended to the K’-class problem to obtain the corresponding 
maximum likelihood solutions for the parameters in which each class-conditional 
density is Gaussian with a shared covariance matrix. Note that the approach of fitting 
Gaussian distributions to the classes is not robust to outliers, because the maximum 
likelihood estimation of a Gaussian is not robust. 


5.3.3 Discrete features 


Let us now consider discrete feature values x;. For simplicity, we begin by look- 
ing at binary feature values x; € {0,1} and discuss the extension to more general 
discrete features shortly. If there are D inputs, then a general distribution would 
correspond to a table of 2? numbers for each class and have 2? — 1 independent 
variables (due to the summation constraint). Because this grows exponentially with 
the number of features, we can seek a more restricted representation. Here we will 
make the naive Bayes assumption in which the feature values are treated as indepen- 
dent and conditioned on the class Cg. Thus, we have class-conditional distributions 


of the form 
D 


p(x\Ce) = [[ wei — eri), (5.64) 


i=1 


which contain D independent parameters for each class. Substituting into (5.46) then 
gives 


D 
a(x) = 5 {xiln upi + (1 — xi) In(1 — Hugpi)} +1Inp(Cp), (5.65) 
i=1 


which again are linear functions of the input values x;. For K = 2 classes, we can 
alternatively consider the logistic sigmoid formulation given by (5.40). Analogous 
results are obtained for discrete variables that take L > 2 states. 


5.3.4 Exponential family 


As we have seen, for both Gaussian distributed and discrete inputs, the posterior 
class probabilities are given by generalized linear models with logistic sigmoid (K = 
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2 classes) or softmax (K > 2 classes) activation functions. These are particular cases 
of a more general result obtained by assuming that the class-conditional densities 
p(x|C;,) are members of the subset of the exponential family of distributions given 
by 


1 1 1 
p(x|Az, s) = sf (<x) g(Ax) exp fatx} ; (5.66) 


Here the scaling parameter s is shared across all the classes. 

For the two-class problem, we substitute this expression for the class-conditional 
densities into (5.41) and we see that the posterior class probability is again given by 
a logistic sigmoid acting on a linear function a(x), which is given by 


a(x) = (Ay — Ag) x + In g(A1) — Ing(Az) + In p(C1) — np(C2). (5.67) 


Similarly, for the -class problem, we substitute the class-conditional density ex- 
pression into (5.46) to give 


ax (x) = A} x + ln g(Ax) + Inp(Cy) (5.68) 


and so again is a linear function of x. 


Discriminative Classifiers 


For the two-class classification problem, we have seen that the posterior probabil- 
ity of class Cı can be written as a logistic sigmoid acting on a linear function of 
x, for a wide choice of class-conditional distributions p(x|C;,) from the exponential 
family. Similarly, for the multi-class case, the posterior probability of class Cw is 
given by a softmax transformation of linear functions of x. For specific choices of 
the class-conditional densities p(x|C;,), we have used maximum likelihood to deter- 
mine the parameters of the densities as well as the class priors p(C;,) and then used 
Bayes’ theorem to find the posterior class probabilities. This represents an example 
of generative modelling, because we could take such a model and generate synthetic 
data by drawing values of x from the marginal distribution p(x) or from any of the 
class-conditional densities p(x|C;,). 

However, an alternative approach is to use the functional form of the general- 
ized linear model explicitly and to determine its parameters directly by using maxi- 
mum likelihood. In this direct approach, we maximize a likelihood function defined 
through the conditional distribution p(C;,,|x), which represents a form of discrimina- 
tive probabilistic modelling. One advantage of the discriminative approach is that 
there will typically be fewer learnable parameters to be determined, as we will see 
shortly. It may also lead to improved predictive performance, particularly when the 
assumed forms for the class-conditional densities represent a poor approximation to 
the true distributions. 


158 


Chapter 4 


Section 5.2 


Section 6.1 


5. SINGLE-LAYER NETWORKS: CLASSIFICATION 


5.4.1 Activation functions 


In linear regression, the model prediction y(x, w) is given by a linear function 
of the parameters 
y(x,w) = wx + wo, (5.69) 
which gives a continuous-valued output in the range (—o0, 00). For classification 
problems, however, we wish to predict discrete class labels, or more generally pos- 
terior probabilities that lie in the range (0, 1). To achieve this, we consider a gener- 
alization of this model in which we transform the linear function of w and wo using 
a nonlinear function f(-) so that 


y(x,w) =f (w'w +w + 0) : (5.70) 


In the machine learning literature, f(-) is known as an activation function, whereas 
its inverse is called a link function in the statistics literature. The decision surfaces 
correspond to y(x) = constant, so that w’x = constant, and hence the decision 
surfaces are linear functions of x, even if the function f(-) is nonlinear. For this 
reason, the class of models described by (5.70) are called generalized linear models 
(McCullagh and Nelder, 1989). However, in contrast to the models used for regres- 
sion, they are no longer linear in the parameters due to the nonlinear function f(-). 
This will lead to more complex analytical and computational properties than for 
linear regression models. Nevertheless, these models are still relatively simple com- 
pared to the much more flexible nonlinear models that will be studied in subsequent 
chapters. 


5.4.2 Fixed basis functions 


So far in this chapter, we have considered classification models that work di- 
rectly with the original input vector x. However, all the algorithms are equally ap- 
plicable if we first make a fixed nonlinear transformation of the inputs using a vector 
of basis functions @(x). The resulting decision boundaries will be linear in the fea- 
ture space @, and these correspond to nonlinear decision boundaries in the original x 
space, as illustrated in Figure 5.15. Classes that are linearly separable in the feature 
space (x) need not be linearly separable in the original observation space x. 

Note that as in our discussion of linear models for regression, one of the basis 
functions is typically set to a constant, say ¢o(x) = 1, so that the corresponding 
parameter wp plays the role of a bias. 

For many problems of practical interest, there is significant overlap in x-space 
between the class-conditional densities p(x|C;,). This corresponds to posterior prob- 
abilities p(C;,|x), which, for at least some values of x, are not 0 or 1. In such cases, 
the optimal solution is obtained by modelling the posterior probabilities accurately 
and then applying standard decision theory. Note that nonlinear transformations 
(x) cannot remove such a class overlap, although they can increase the level of 
overlap or create an overlap where none existed in the original observation space. 
However, suitable choices of nonlinearity can make the process of modelling the 
posterior probabilities easier. However, such fixed basis function models have im- 
portant limitations, and these will be resolved in later chapters by allowing the basis 
functions themselves to adapt to the data. 
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Figure 5.15 Illustration of the role of nonlinear basis functions in linear classification models. The left-hand 
plot shows the original input space (21,22) together with data points from two classes labelled red and blue. 
Two ‘Gaussian’ basis functions ¢1(x) and ¢2(x) are defined in this space with centres shown by the green 
crosses and with contours shown by the green circles. The right-hand plot shows the corresponding feature 
space (¢1, 2) together with the linear decision boundary obtained given by a logistic regression model of the 
form discussed in Section 5.4.3. This corresponds to a nonlinear decision boundary in the original input space, 
shown by the black curve in the left-hand plot. 


5.4.3 Logistic regression 


We first consider the problem of two-class classification. In our discussion of 
generative approaches in Section 5.3, we saw that under rather general assumptions, 
the posterior probability of class Cı can be written as a logistic sigmoid acting on a 
linear function of the feature vector @ so that 


PCiE) = yle) = o (w') (5.71) 


with p(C2|ġ) = 1 — p(Ci|). Here o(-) is the logistic sigmoid function defined by 
(5.42). In the terminology of statistics, this model is known as logistic regression, 
although it should be emphasized that this is a model for classification rather than 
for continuous variable. 

For an M/-dimensional feature space œ, this model has M adjustable parameters. 
By contrast, if we had fitted Gaussian class-conditional densities using maximum 
likelihood, we would have used 2M parameters for the means and M(M + 1)/2 
parameters for the (shared) covariance matrix. Together with the class prior p(C,), 
this gives a total of M (M +5) /2+1 parameters, which grows quadratically with M, 
in contrast to the linear dependence on M of the number of parameters in logistic 
regression. For large values of M, there is a clear advantage in working with the 
logistic regression model directly. 
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Section 4.1.3 


Chapter 7 


We now use maximum likelihood to determine the parameters of the logistic 
regression model. To do this, we will make use of the derivative of the logistic sig- 
moid function, which can conveniently be expressed in terms of the sigmoid function 
itself: 

do 

da 
For a data set {¢,,,tn}, where p, = (x,) and tn € {0,1}, with n = 1,...,N, 
the likelihood function can be written 


=cl=o), (5.72) 


p(t|w) = [ot CET ™ (5.73) 


where t = (tı,..., tn)" and yn = p(C;|@,,). As usual, we can define an error 
function by taking the negative logarithm of the likelihood, which gives the cross- 
entropy error function: 


E(w) = —Inp(t|w) = -5 {tn nyn + (1 — tn) n(1 — yn)} (5.74) 


where yn = c (an) and an = w'@,,. Taking the gradient of the error function with 
respect to w, we obtain 


N 


VE(w) =X (yn — tn) On (5.75) 


n=1 


where we have made use of (5.72). We see that the factor involving the derivative 
of the logistic sigmoid has cancelled, leading to a simplified form for the gradient 
of the log likelihood. In particular, the contribution to the gradient from data point 
n is given by the ‘error’ Yn — tn between the target value and the prediction of 
the model times the basis function vector @,,. Furthermore, comparison with (4.12) 
shows that this takes precisely the same form as the gradient of the sum-of-squares 
error function for the linear regression model. 

The maximum likelihood solution corresponds to VE(w) = 0. However, from 
(5.75) we see that this no longer corresponds to a set of linear equations, due to 
the nonlinearity in y(-), and so this equation does not have a closed-form solution. 
One approach to finding a maximum likelihood solution would be to use stochastic 
gradient descent, in which VE, is the nth term on the right-hand side of (5.75). 
Stochastic gradient descent will be the principal approach to training the highly non- 
linear neural networks discussed in later chapters. However, the maximum likelihood 
equation is only ‘slightly’ nonlinear, and in fact the error function (5.74), in which the 
model is defined by (5.71), is a convex function of the parameters, which allows the 
error function to be minimized using a simple algorithm called iterative reweighted 
least squares or IRLS (Bishop, 2006). However, this does not easily generalize to 
more complex models such as deep neural networks. 
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Note that maximum likelihood can exhibit severe over-fitting for data sets that 
are linearly separable. This arises because the maximum likelihood solution occurs 
when the hyperplane corresponding to 7 = 0.5, equivalent to w' @ = 0, separates 
the two classes and the magnitude of w goes to infinity. In this case, the logis- 
tic sigmoid function becomes infinitely steep in feature space, corresponding to a 
Heaviside step function, so that every training point from each class k is assigned a 
posterior probability p(C;,|x) = 1. Furthermore, there is typically a continuum of 
such solutions because any separating hyperplane will give rise to the same posterior 
probabilities at the training data points. Maximum likelihood provides no way to 
favour one such solution over another, and which solution is found in practice will 
depend on the choice of optimization algorithm and on the parameter initialization. 
Note that the problem will arise even if the number of data points is large compared 
with the number of parameters in the model, so long as the training data set is lin- 
early separable. The singularity can be avoided by adding a regularization term to 
the error function. 


5.4.4 Multi-class logistic regression 


In our discussion of generative models for multi-class classification, we have 
seen that, for a large class of distributions from the exponential family, the posterior 
probabilities are given by a softmax transformation of linear functions of the feature 
variables, so that 


exp(ax) 
P(Crlo) = wl) = = (5.76) 
D2 exp(aj) 
where the pre-activations a; are given by 
ak = w,?. (5.77) 


There we used maximum likelihood to determine separately the class-conditional 
densities and the class priors and then found the corresponding posterior probabilities 
using Bayes’ theorem, thereby implicitly determining the parameters {w;,}. Here we 
consider the use of maximum likelihood to determine the parameters {wx } of this 
model directly. To do this, we will require the derivatives of y;, with respect to all 
the pre-activations aj. These are given by 


a = Yklkj — yj) (5.78) 


where J;,; are the elements of the identity matrix. 

Next we write down the likelihood function. This is most easily done using 
the 1-of-K coding scheme in which the target vector tn for a feature vector @,, 
belonging to class Cx is a binary vector with all elements zero except for element k, 
which equals one. The likelihood function is then given by 


N K 
p(Tlwi,....w«) = [J [reen =I 6 
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Figure 5.16 Representation of a multi-class lin- 
pe i om—1(X) YK (x, w) 
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ear Classification model as a neu- 

ral network having a single layer 

of connections. Each basis func- 

tion is represented by a node, 

with the solid node represent- pı (x) 
ing the ‘bias’ basis function ¢o, 

whereas each output y1,..., yn İS o(x) 
also represented by a node. The 

links between the nodes represent 

the corresponding weight and bias 

parameters. 


yı (x, w) 


where Ynk = Yx(@,,), and T is an N x K matrix of target variables with elements 
tnk. Taking the negative logarithm then gives 


N K 
E(wi,...,WwK) =—Inp(T|wi,...,w ae eo nk MM Ynk, (5.80) 
1 k=1 


which is known as the cross-entropy error function for the multi-class classification 
problem. 

We now take the gradient of the error function with respect to one of the param- 
eter vectors wj. Making use of the result (5.78) for the derivatives of the softmax 
function, we obtain 


N 
Vw, E(wi,..- wK) = X (yng — tnj) On (5.81) 


n=1 


where we have made use of }°>, tnk = 1. Again, we could optimize the parameters 
through stochastic gradient descent. 

Once again, we see the same form arising for the gradient as was found for the 
sum-of-squares error function with the linear model and for the cross-entropy error 
with the logistic regression model, namely the product of the error (yn; — tnj) times 
the basis function activation @,,. These are examples of a more general result that 
we will explore later. 

Linear classification models can be represented as single-layer neural networks 
as shown in Figure 5.16. If we consider the derivative of the error function with 
respect to a weight wi, which links basis function ¢; (x) to output unit tg, we have 
from (5.81) 


OE (wi, oy W 
Ow; 


N 
p2. Ynk — tnk) bi(Xn): (5.82) 


Comparing this with Figure 5.16, we see that, for each data point n this gradient 
takes the form of the output of the basis function at the input end of the weight link 
with the ‘error’ (Yng — tnk) at the output end. 


Figure 5.17 Schematic example of a probability den- 
sity p(@) shown by the blue curve, given in this example 
by a mixture of two Gaussians, along with its cumulative 
distribution function f(a), shown by the red curve. Note 
that the value of the blue curve at any point, such as 
that indicated by the vertical green line, corresponds to 
the slope of the red curve at the same point. Conversely, 
the value of the red curve at this point corresponds to the 
area under the blue curve indicated by the shaded green 
region. In the stochastic threshold model, the class label 
takes the value t = 1 if the value of a = wọ exceeds 
a threshold, otherwise it takes the value t = 0. This is 
equivalent to an activation function given by the cumula- 
tive distribution function f(a). 
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5.4.5 Probit regression 


We have seen that, for a broad range of class-conditional distributions described 
by the exponential family, the resulting posterior class probabilities are given by a 
logistic (or softmax) transformation acting on a linear function of the feature vari- 
ables. However, not all choices of class-conditional density give rise to such a simple 
form for the posterior probabilities, which suggests that it might be worth exploring 
other types of discriminative probabilistic model. Consider the two-class case, again 
remaining within the framework of generalized linear models, so that 


p(t = 1a) = f(a) (5.83) 


where a = w'@, and f(-) is the activation function. 

One way to motivate an alternative choice for the link function is to consider a 
noisy threshold model, as follows. For each input @,,, we evaluate an = w' @,, and 
then we set the target value according to 

= if an > 0, EN 
tn = 0, otherwise. 


If the value of 0 is drawn from a probability density p(@), then the corresponding 
activation function will be given by the cumulative distribution function 


f(a) = [ RO (5.85) 


as illustrated in Figure 5.17. 

As a specific example, suppose that the density p(0) is given by a zero-mean, 
unit-variance Gaussian. The corresponding cumulative distribution function is given 
by 


a= I " N(610, 1) a6, (5.86) 
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which is known as the probit function. It has a sigmoidal shape and is compared 
with the logistic sigmoid function in Figure 5.12. Note that the use of a Gaussian 
distribution with general mean and variances does not change the model because this 
is equivalent to a re-scaling of the linear coefficients w. Many numerical packages 
can evaluate a closely related function defined by 


erf(a) = 5 [ exp(—6?/2) d0 (5.87) 


and known as the erf function or error function (not to be confused with the error 
function of a machine learning model). It is related to the probit function by 


O(a) = ; fı + O : (5.88) 
The generalized linear model based on a probit activation function is known as probit 
regression. We can determine the parameters of this model using maximum likeli- 
hood by a straightforward extension of the ideas discussed earlier. In practice, the 
results found using probit regression tend to be like those of logistic regression. 

One issue that can occur in practical applications is that of outliers, which can 
arise for instance through errors in measuring the input vector x or through misla- 
belling of the target value t. Because such points can lie a long way to the wrong side 
of the ideal decision boundary, they can seriously distort the classifier. The logistic 
and probit regression models behave differently in this respect because the tails of 
the logistic sigmoid decay asymptotically like exp(—2) for |x| — oo, whereas for 
the probit activation function, they decay like exp(—2?), and so the probit model 
can be significantly more sensitive to outliers. 


5.4.6 Canonical link functions 


For the linear regression model with a Gaussian noise distribution, the error 
function, corresponding to the negative log likelihood, is given by (4.11). If we 
take the derivative with respect to the parameter vector w of the contribution to the 
error function from a data point n, this takes the form of the ‘error’ Yn — tn times the 
feature vector @,,, where yn = w' @,,. Similarly, for the combination of the logistic- 
sigmoid activation function and the cross-entropy error function (5.74) and for the 
softmax activation function with the multi-class cross-entropy error function (5.80), 
we again obtain this same simple form. We now show that this is a general result 
of assuming a conditional distribution for the target variable from the exponential 
family along with a corresponding choice for the activation function known as the 
canonical link function. 

We again make use of the restricted form (3.169) of exponential family distri- 
butions. Note that here we are applying the assumption of exponential family distri- 
bution to the target variable t, in contrast to Section 5.3.4 where we applied it to the 
input vector x. We therefore consider conditional distributions of the target variable 


of the form i 
t t 
p(t\n, s) = —h G) g(n) exp (z) ; (5.89) 
s s s 


5.4. Discriminative Classifiers 165 


Using the same line of argument as led to the derivation of the result (3.172), we see 
that the conditional mean of t, which we denote by y, is given by 


d 
n| = E ln g(n). (5.90) 


y= E[t 


Thus, y and ņ must related, and we denote this relation through 7 = y (y). 

Following Nelder and Wedderburn (1972), we define a generalized linear model 
to be one for which y is a nonlinear function of a linear combination of the input (or 
feature) variables so that 


y = f(w'd) (5.91) 


where f(-) is known as the activation function in the machine learning literature, and 
f—*(-) is known as the link function in statistics. 

Now consider the log likelihood function for this model, which, as a function of 
n, is given by 


N 


ntn 
ln p(t|n, s) -Yur (taln, s) = =% fims) + m i + const (5.92) 


n=1 


where we are assuming that all observations share a common scale parameter (which 
corresponds to the noise variance for a Gaussian distribution, for instance) and so s 
is independent of n. The derivative of the log likelihood with respect to the model 
parameters w is then given by 


Vw lnp(t|n, s) 


I 
Mz 
-n 
= Qu 


dn d 
In g(m) t =} Ta Us Vwan 


n=1 n dy, dan 
yl 
n=1 


where an = w'@,,, and we have used yn = f(an) together with the result (5.90) 
for Eft|n]. We now see that there is a considerable simplification if we choose a 
particular form for the link function f~'(y) given by 


), (5.94) 


which gives f(w(y)) = y and hence 
we have a = ~ and hence f'(a)’ (y 
function reduces to 


vy 
‘(w)w'(y) = 1. Also, because a = f—'(y), 
= 1. In this case, the gradient of the error 


1 
Vln E(w =a Sm = tn} Pn: (5.95) 


We have seen that there is a natural pairing between the choice of error function 
and the choice of output-unit activation function. Although we have derived this 
result in the context of single-layer network models, the same considerations apply 
to deep neural networks discussed in later chapters. 
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5.1 


5.2 


5.3 


5.4 


5.5 


5.6 


5.7 


(x) Consider a classification problem with K classes and a target vector t that uses a 
1-of-K binary coding scheme. Show that the conditional expectation E[t|x] is given 
by the posterior probability p(C;,|x). 


(x x) Given a set of data points {Xn }, we can define the convex hull to be the set of 
all points x given by 


x = 37 anxn (5.96) 


where an > 0 and 5°, an = 1. Consider a second set of points {yn } together with 
their corresponding convex hull. By definition, the two sets of points will be linearly 
separable if there exists a vector W and a scalar wọ such that W?x,, + wo > 0 for all 
x, and W'y,, + wo < 0 for all yn. Show that if their convex hulls intersect, the two 
sets of points cannot be linearly separable, and conversely that if they are linearly 
separable, their convex hulls do not intersect. 


(x x) Consider the minimization of a sum-of-squares error function (5.14), and sup- 
pose that all the target vectors in the training set satisfy a linear constraint 


a't, +b=0 (5.97) 


where t,, corresponds to the nth row of the matrix T in (5.14). Show that as a 
consequence of this constraint, the elements of the model prediction y(x) given by 
the least-squares solution (5.16) also satisfy this constraint, so that 


a’y(x)+b=0. (5.98) 


To do so, assume that one of the basis functions ¢9(x) = 1 so that the corresponding 
parameter wo plays the role of a bias. 


(x x) Extend the result of Exercise 5.3 to show that if multiple linear constraints are 
satisfied simultaneously by the target vectors, then the same constraints will also be 
satisfied by the least-squares prediction of a linear model. 


(x) Use the definition (5.38), along with (5.30) and (5.31) to derive the result (5.39) 
for the F-score. 


(x x) Consider two non-negative numbers a and b, and show that, if a < b, then 
a < (ab)'/?. Use this result to show that, if the decision regions of a two-class 
classification problem are chosen to minimize the probability of misclassification, 
this probability will satisfy 


p(mistake) < fve, Cı)p(x, G dx. (5.99) 


(x) Given a loss matrix with elements L;,;, the expected risk is minimized if, for 
each x, we choose the class that minimizes (5.23). Verify that, when the loss matrix 


5.8 


5.9 


5.10 


5.11 


5.12 


5.13 


5.14 
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is given by Lgj = 1 — Iķj, where Ip; are the elements of the identity matrix, this 
reduces to the criterion of choosing the class having the largest posterior probability. 
What is the interpretation of this form of loss matrix? 


(x) Derive the criterion for minimizing the expected loss when there is a general loss 
matrix and general prior probabilities for the classes. 


(x) Consider the average of the posterior probabilities over a set of N data points in 
the form 


1 
a iD p(Ck|Xn). (5.100) 
N=1 


By taking the limit N — ov, show that this quantity approaches the prior class 
probability p(C;). 


(x x) Consider a classification problem in which the loss incurred when an input 
vector from class C is classified as belonging to class C; is given by the loss matrix 
Lk; and for which the loss incurred in selecting the reject option is A. Find the 
decision criterion that will give the minimum expected loss. Verify that this reduces 
to the reject criterion discussed in Section 5.2.3 when the loss matrix is given by 
Ly; = 1 — Ikj. What is the relationship between A and the rejection threshold 0? 


(x) Show that the logistic sigmoid function (5.42) satisfies the property o(—a) = 
1 — o (a) and that its inverse is given by o~'(y) = In {y/(1 — y)}. 


(x) Using (5.40) and (5.41), derive the result (5.48) for the posterior class probability 
in the two-class generative model with Gaussian densities, and verify the results 
(5.49) and (5.50) for the parameters w and wo. 


(x) Consider a generative classification model for K classes defined by prior class 
probabilities p(C,) = a, and general class-conditional densities p(@|C;,) where @ 
is the input feature vector. Suppose we are given a training data set {@,,, tn} where 
n = 1,...,N, and tn is a binary target vector of length K that uses the 1-of-K 
coding scheme, so that it has components tnj = I; if data point n is from class Cx. 
Assuming that the data points are drawn independently from this model, show that 
the maximum-likelihood solution for the prior probabilities is given by 


N (5.101) 


Tk 


where JV; is the number of data points assigned to class Cx. 


(xx) Consider the classification model of Exercise 5.13 and now suppose that the 
class-conditional densities are given by Gaussian distributions with a shared covari- 
ance matrix, so that 


P(PICK) = N (P| og, X). (5.102) 
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5.15 


5.16 


Section 11.2.3 
5.17 


5.18 


5.19 


5.20 


5.21 


Show that the maximum likelihood solution for the mean of the Gaussian distribution 
for class Cx is given by 


N 
1 
= — ` t .l 
Hk N; a nkPn3 (5 03) 


which represents the mean of those feature vectors assigned to class Cp. Similarly, 
show that the maximum likelihood solution for the shared covariance matrix is given 
by 


KN 
Say t . 
2. y Sk (5.104) 
k=1 
where 
1 N 
Si =H So tne(On — Me) (On — Me)” (5.105) 
m=i 


Thus, X is given by a weighted average of the covariances of the data associated with 
each class, in which the weighting coefficients are given by the prior probabilities of 
the classes. 


(x x) Derive the maximum likelihood solution for the parameters { upi} of the proba- 
bilistic naive Bayes classifier with discrete binary features described in Section 5.3.3. 


(x x) Consider a classification problem with K classes for which the feature vector 
@ has M components each of which can take L discrete states. Let the values of the 
components be represented by a 1-of-L binary coding scheme. Further suppose that, 
conditioned on the class Cx, the M components of @ are independent, so that the 
class-conditional density factorizes with respect to the feature vector components. 
Show that the quantities a, given by (5.46), which appear in the argument to the 
softmax function describing the posterior class probabilities, are linear functions of 
the components of @. Note that this represents an example of a naive Bayes model. 


(x x) Derive the maximum likelihood solution for the parameters of the probabilistic 
naive Bayes classifier described in Exercise 5.16. 


(x) Verify the relation (5.72) for the derivative of the logistic sigmoid function de- 
fined by (5.42). 


(x) By making use of the result (5.72) for the derivative of the logistic sigmoid, show 
that the derivative of the error function (5.74) for the logistic regression model is 
given by (5.75). 


(x) Show that for a linearly separable data set, the maximum likelihood solution 
for the logistic regression model is obtained by finding a vector w whose decision 
boundary wT (x) = 0 separates the classes and then taking the magnitude of w to 
infinity. 


(x) Show that the derivatives of the softmax activation function (5.76), where the az 
are defined by (5.77), are given by (5.78). 
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5.22 (x) Using the result (5.78) for the derivatives of the softmax activation function, show 
that the gradients of the cross-entropy error (5.80) are given by (5.81). 


5.23 (x) Show that the probit function (5.86) and the erf function (5.87) are related by 
(5.88). 


5.24 (xx) Suppose we wish to approximate the logistic sigmoid o(a) defined by (5.42) 
by a scaled probit function ®(Aa), where ®(a) is defined by (5.86). Show that if A is 
chosen so that the derivatives of the two functions are equal at a = 0, then A? = 77/8. 


Chapter 4 


Chapter 5 


Check for 
updates 


Deep Neural 
Networks 


In recent years, neural networks have emerged as, by far, the most important ma- 
chine learning technology for practical applications, and we therefore devote a large 
fraction of this book to studying them. Previous chapters have already laid many 
of the foundations we will need. In particular, we have seen that linear regression 
models that comprise linear combinations of fixed nonlinear basis functions can be 
expressed as neural networks having a single layer of weight and bias parameters. 
Likewise, classification models based on linear combinations of basis functions can 
also be viewed as single-layer neural networks. These allowed us to introduce several 
important concepts before we embark on a discussion of more complex multilayered 
networks in this chapter. 

Given a sufficient number of suitably chosen basis functions, such linear models 
can approximate any given nonlinear transformation from inputs to outputs to any 
desired accuracy and might therefore appear to be sufficient to tackle any practical 
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6.1. 


application. However, these models have some severe limitations, and so we will 
begin our discussion of neural networks by exploring these limitations and under- 
standing why it is necessary to use basis functions that are themselves learned from 
data. This leads naturally to a discussion of neural networks having more than one 
layer of learnable parameters. These are known as feed-forward networks or multi- 
layer perceptrons. We will also discuss the benefits of having many such layers of 
processing, leading to the key concept of deep neural networks that now dominate 
the field of machine learning. 


Limitations of Fixed Basis Functions 


Linear basis function models for classification are based on linear combinations of 
basis functions ¢;(x) and take the form 


M 
y(x, w) =f | X wid; (x) + wo (6.1) 
j=l 


where f(-) is a nonlinear output activation function. Linear models for regression 
take the same form but with f(-) replaced by the identity. These models allow for 
an arbitrary set of nonlinear basis functions {¢;(x)}, and because of the generality 
of these basis functions, such models can in principle provide a solution to any re- 
gression or classification problem. This is true in a trivial sense in that if one of the 
basis functions corresponds to the desired input-to-output transformation, then the 
learnable linear layer simply has to copy the value of this basis function to the output 
of the model. 

More generally, we would expect that a sufficiently large and rich set of basis 
functions would allow any desired function to be approximated to arbitrary accu- 
racy. It would seem therefore that such linear models constitute a general purpose 
framework for solving problems in machine learning. Unfortunately, there are some 
significant shortcomings with linear models, which arise from the assumption that 
the basis functions ¢;(x) are fixed and independent of the training data. To under- 
stand these limitations, we start by looking at the behaviour of linear models as the 
number of input variables is increased. 


6.1.1 The curse of dimensionality 


Consider a simple regression model for a single input variable given by a poly- 
nomial of order M in the form 


y(x, Ww) = wo + wiz + Won" +...4 wmr” (6.2) 


and let us see what happens if we increase the number of inputs. If we have D input 
variables {71,...,2p}, then a general polynomial with coefficients up to order 3 


Figure 6.1 
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Plot of the Iris data in which red, 
green, and blue points denote 
three species of iris flower and the 
axes represent measurements of 
the length and width of the sepal, 
respectively. Our goal is to clas- 
sify a new test point such as the 
one denoted by x. 


sepal width 


sepal length 
would take the form 
D D D 
y(x, w) = wo + 2 Wizi + 3 3 Wig lili + y 5 > Wijgktitj Lp. (6.3) 
i=1 j=1 i=1 j=1 k=1 


As D increases, the growth in the number of independent coefficients is O(D%), 
whereas for a polynomial of order M, the growth in the number of coefficients is 
O(D™) (Bishop, 2006). We see that in spaces of higher dimensionality, polynomials 
can rapidly become unwieldy and of little practical utility. 

The severe difficulties that can arise in spaces of many dimensions is sometimes 
called the curse of dimensionality (Bellman, 1961). It is not limited to polynomial 
regression but is in fact quite general. Consider the use of linear models for solv- 
ing classification problems. Figure 6.1 shows a plot of data from the Iris data set 
comprising 50 observations taken from each of three species of iris flowers. Each 
observation has four variables representing measurements of the sepal length, sepal 
width, petal length, and petal width. For this illustration, we consider only the sepal 
length and sepal width variables. Given these 150 observations as training data, our 
goal is to classify a new test point, such as the one denoted by the cross in Figure 6.1, 
by assigning it to one of the three species. We observe that the cross is close to sev- 
eral red points, and so we might suppose that it belongs to the red class. However, 
there are also some green points nearby, so we might think that it could instead be- 
long to the green class. It seems less likely that it belongs to the blue class. The 
intuition here is that the identity of the cross should be determined more strongly by 
nearby points from the training set and less strongly by more distant points, and this 
intuition turns out to be reasonable. 

One very simple way of converting this intuition into a learning algorithm would 
be to divide the input space into regular cells, as indicated in Figure 6.2. When we 
are given a test point and we wish to predict its class, we first decide which cell it 
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Figure 6.2 Illustration of a simple approach 
for solving classification problems 
in which the input space is di- 
vided into cells and any new test 
point is assigned to the class that 
has the most representatives in 
the same cell as the test point. 
As we shall see shortly, this sim- 
plistic approach has some severe 
shortcomings. 


sepal width 


sepal length 


belongs to, and then we find all the training data points that fall in the same cell. The 
identity of the test point is predicted to be the same as the class having the largest 
number of training points in the same cell as the test point (with ties being broken 
at random). We can view this as a basis function model in which there is a basis 
function ¢;(x) for each grid cell, which simply returns zero if x lies outside the 
grid cell, and otherwise returns the majority class of the training data points that fall 
inside the cell. The output of the model is then given by the sum of the outputs of all 
the basis functions. 

There are numerous problems with this naive approach, but one of the most 
severe becomes apparent when we consider its extension to problems having larger 
numbers of input variables, corresponding to input spaces of higher dimensionality. 
The origin of the problem is illustrated in Figure 6.3, which shows that, if we divide a 
region of a space into regular cells, then the number of such cells grows exponentially 
with the dimensionality of the space. The challenge with an exponentially large 
number of cells is that we would need an exponentially large quantity of training 


Figure 6.3 Illustration of the curse 
of dimensionality, showing how the 
number of regions of a regular grid 
grows exponentially with the dimen- 


sionality D of the space. For clarity, 72 
only a subset of the cubical regions 
are shown for D = 3. 
ttt £i 


Figure 6.4 Plot of the fraction of the volume 
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of a hypersphere of radius r = 1 
lying in the range r = 1 — e to 
r = 1 for various values of the di- 
mensionality D. 


volume fraction 


data to ensure that the cells are not empty. We have already seen in Figure 6.2 that 
some cells contain no training points. Hence, a test point in such cells cannot be 
classified. Clearly, we have no hope of applying such a technique in a space of more 
than a few variables. The difficulties with both the polynomial regression example 
and the Iris data classification example arise because the basis functions were chosen 
independently of the problem being solved. We will need to be more sophisticated 
in our choice of basis functions if we are to circumvent the curse of dimensionality. 


6.1.2 High-dimensional spaces 


First, however, we will look more closely at the properties of spaces with higher 
dimensionality where our geometrical intuitions, formed through a life spent in a 
space of three dimensions, can fail badly. As a simple example, consider a hyper- 
sphere of radius r = 1 in a space of D dimensions, and ask what is the fraction of 
the volume of the hypersphere that lies between radius r = 1 — e and r = 1. We can 
evaluate this fraction by noting that the volume Vp(r) of a hypersphere of radius r 
in D dimensions must scale as r?, and so we write 


Vp(r) = Kpr? (6.4) 
where the constant K p depends only on D. Thus, the required fraction is given by 
Vp(1) = Vp(1 = €) 
Vp(1) 


which is plotted as a function of e for various values of D in Figure 6.4. We see that, 
for large D, this fraction tends to 1 even for small values of e. Thus, we arrive at 
the remarkable result that, in spaces of high dimensionality, most of the volume of a 
hypersphere is concentrated in a thin shell near the surface! 


=1-—(1—<)”, (6.5) 
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Figure 6.5 Plot of the probability density with 2 
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respect to radius r of a Gaussian 
distribution for various values of 

the dimensionality D. In a high- 
dimensional space, most of the 
probability mass of a Gaussian > 
is located within a thin shell at a x 
specific radius. 


As a further example of direct relevance to machine learning, consider the be- 
haviour of a Gaussian distribution in a high-dimensional space. If we transform from 
Cartesian to polar coordinates and then integrate out the directional variables, we ob- 
tain an expression for the density p(r) as a function of radius r from the origin. Thus, 
p(r)ôr is the probability mass inside a thin shell of thickness dr located at radius r. 
This distribution is plotted, for various values of D, in Figure 6.5, and we see that 
for large D, the probability mass of the Gaussian is concentrated in a thin shell at a 
specific radius. 

In this book, we make extensive use of illustrative examples involving one or two 
variables, because this makes it particularly easy to visualize these spaces graphi- 
cally. The reader should be warned, however, that not all intuitions developed in 
spaces of low dimensionality will generalize to situations involving many dimen- 
sions. 

Finally, although we have talked about the curse of dimensionality, there can 
also be advantages to working in high-dimensional spaces. Consider the situation 
shown in Figure 6.6. We see that this data set, in which each data point consists 
of a pair of values (x1, £2), is linearly separable, but when only the value of xı is 
observed, the classes have a strong overlap. The classification problem is therefore 
much easier in the higher-dimensional space. 


6.1.3 Data manifolds 


With both the polynomial regression model and the grid-based classifier in Fig- 
ure 6.2, we saw that the number of basis functions grows rapidly with dimensionality, 
making such methods impractical for applications involving even a few dozen vari- 
ables, never mind the millions of inputs that often arise with, say, image processing. 
The problem is that the basis functions are fixed ahead of time and do not depend on 
the data, or indeed even on the specific problem being solved. We need to find a way 
to create basis functions that are tuned to the particular application. 

Although the curse of dimensionality certainly raises important issues for ma- 
chine learning applications, it does not prevent us from finding effective techniques 
applicable to high-dimensional spaces. One reason for this is that real data will 
generally be confined to a region of the data space having lower effective dimen- 
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(a) (b) 


Figure 6.6 Illustration of a data set in two dimensions (x1, x2) in which data points from the two classes de- 
picted using green and red circles can be separated by a linear decision surface, as seen in (a). If, however, only 
the variable xı is measured then the classes are no longer separable, as seen in (b). 


Figure 6.7 


sionality. Consider the images shown in Figure 6.7. Each image is a point in a 
high-dimensional space whose dimensionality is determined by the number of pix- 
els. Because the objects can occur at different vertical and horizontal positions within 
the image and in different orientations, there are three degrees of freedom of vari- 
ability between images, and a set of images will, to a first approximation, live on 
a three-dimensional manifold embedded within the high-dimensional space. Due to 
the complex relationships between the object position or orientation and the pixel 
intensities, this manifold will be highly nonlinear. 

In fact, the number of pixels is really an artefact of the image generation pro- 
cess since they represent measurements of a continuous world. Capturing the same 
image at a higher resolution increases the dimensionality D of the data space with- 
out changing the fact that the images still live on a three-dimensional manifold. If 
we can associate localized basis functions with the data manifold, rather than with 
the entire high-dimensional data space, we might expect that the number of required 
basis functions would grow exponentially with the dimensionality of the manifold 
rather than with the dimensionality of the data space. Since the manifold will typi- 
cally have a much lower dimensionality than the data space, this represents a huge 


Examples of images of a hand- 
written digit that differ in the 
location of the digit within 
the images as well as in 
their orientation. This data 
lives on a nonlinear three- 
dimensional manifold within the | > 


high-dimensional image space. 
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Figure 6.8 The top row shows examples of natural images of size 64 x 64 pixels, whereas the bottom 


row shows randomly generated images of the same size obtained by drawing pixel values 
from a uniform probability distribution over the possible pixel colours. 


improvement. Effectively, neural networks learn a set of basis functions that are 
adapted to data manifolds. Moreover, for a particular application, not all directions 
within the manifold may be significant. For example, if we wish to determine only 
the orientation, and not the position, of the object in Figure 6.7, then there is only one 
relevant degree of freedom on the manifold and not three. Neural networks are also 
able to learn which directions on the manifold are relevant to predicting the desired 
outputs. 

Another way to see that real data is confined to low-dimensional manifolds is 
to consider the task of generating random images. In Figure 6.8 we see examples 
of natural images along with examples of synthetic images of the same resolution 
generated by sampling each of the red, green, and blue intensities at each pixel inde- 
pendently at random from a uniform distribution. We see that none of the synthetic 
images look at all like natural images. The reason is that these random images lack 
the very strong correlations between pixels that natural images exhibit. For example, 
two adjacent pixels in a natural image have a much higher probability of having the 
same, or very similar, colour, than would two adjacent images in the random exam- 
ples. Each of the images in Figure 6.8 corresponds to a point in a high-dimensional 
space, yet natural images cover only a tiny fraction of this space. 


6.1.4 Data-dependent basis functions 


We have seen that simple basis functions that are chosen independently of the 
problem being solved can run into significant limitations, particularly in spaces of 
high dimensionality. If we want to use basis functions in such situations, then one 
approach would be to use expert knowledge to hand-craft the basis functions in a 
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way that is specific to each application. For many years, this was the mainstream 
approach in machine learning. Basis functions, often called features, would be de- 
termined by a combination of domain knowledge and trial-and-error. However, this 
approach met with limited success and was superseded by data-driven approaches 
in which basis functions are learned from the training data. Domain knowledge still 
plays a role in modern machine learning, but at a more qualitative level in designing 
network architectures where it can capture appropriate inductive bias, as we will see 
in later chapters. 

Since data in a high-dimensional space may be confined to a low-dimensional 
manifold, we do not need basis functions that densely fill the whole input space, 
but instead we can use basis functions that are themselves associated with the data 
manifold. One way to do this is to have one basis function associated with each data 
point in the training set, which ensures that the basis functions are automatically 
adapted to the underlying data manifold. An example of such a model is that of 
radial basis functions (Broomhead and Lowe, 1988), which have the property that 
each basis function depends only on the radial distance (typically Euclidean) from a 
central vector. If the basis centres are chosen to be the input data values {x,,} then 
there is one basis function n(x) for each data point, which will therefore capture 
the whole of the data manifold. A typical choice for a radial basis function is 


2 
On (x) = exp (=É) (6.6) 


S 


where s is a parameter controlling the width of the basis function. Although it can be 
quick to set up such a model, a major problem with this technique is that it becomes 
computationally unwieldy for large data sets. Moreover, the model needs careful 
regularization to avoid severe over-fitting. 

A related approach, called a support vector machine or SVM (Vapnik, 1995; 
Schölkopf and Smola, 2002; Bishop, 2006), addresses this by again defining basis 
functions that are centred on each of the training data points and then selecting a 
subset of these automatically during training. As a result, the effective number of 
basis functions in the resulting models is generally much smaller than the number of 
training points, although it is often still relatively large and typically increases with 
the size of the training set. Support vector machines also do not produce probabilistic 
outputs, and they do not naturally generalize to more than two classes. Methods such 
as radial basis functions and support vector machines have been superseded by deep 
neural networks, which are much better at exploiting very large data sets efficiently. 
Moreover, as we will see later, neural networks are able to learn deep hierarchical 
representations, which are crucial to achieving high prediction accuracy in more 
complex applications. 
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6.2. Multilayer Networks 


In the previous section, we saw that to apply linear models of the form (6.1) to prob- 
lems involving large-scale data sets and high-dimensional spaces, we need to find a 
set of basis functions that is tuned to the problem being solved. The key idea behind 
neural networks is to choose basis functions ¢;(x) that themselves have learnable 
parameters and then allow these parameters to be adjusted, along with the coeffi- 
cients {w,}, during training. We then optimize the whole model by minimizing an 
error function using gradient-based optimization methods, such as stochastic gradi- 
ent descent, where the error function is defined jointly across all the parameters in 
the model. 

There are, of course, many ways to construct parametric nonlinear basis func- 
tions. One key requirement is that they must be differentiable functions of their 
learnable parameters so that we can apply gradient-based optimization. The most 
successful choice has been to use basis functions that follow the same form as (6.1), 
so that each basis function is itself a nonlinear function of a linear combination of 
the inputs, where the coefficients in the linear combination are learnable parameters. 
Note that this construction can clearly be extended recursively to give a hierarchical 
model with many layers, which forms the basis for deep neural networks. 

Consider a basic neural network model having two layers of learnable parame- 


ters. First, we construct M linear combinations of the input variables 71,..., £p in 
the form 

= Soul w: Dr, F wi (6.7) 
where j = 1,..., M, and the superscript (1) indicates that the corresponding pa- 


rameters are in the first ‘layer’ of the network. We will refer to the parameters w 


as weights and the parameters w as biases, while the quantities g” are called 
pre-activations. Each of the quantities a; is then transformed using a differentiable, 


nonlinear activation function h(-) to give 
2” = nas”), (6.8) 


which represent the outputs of the basis functions in (6.1). In the context of neu- 
ral networks, these basis functions are called hidden units. We will explore various 
choices for the nonlinear function h(-) shortly, but here we note that provided the 
derivative h’(-) can be evaluated, then the overall network function will be differen- 
tiable. Following (6.1), these values are again linearly combined to give 


-X wz Oy we? (6.9) 


where k = 1,..., K, and K is the total number of outputs. This transformation 
corresponds to the second layer of the network, and again the wl? are bias parame- 
ters. Finally, the {a2} are transformed using an appropriate output-unit activation 
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Figure 6.9 Network diagram for a two-layer Hidden units 
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neural network. The input, hid- 
den, and output variables are Inputs 7 w® Outputs 
represented by nodes, and the 
weight parameters are repre- 
sented by links between the 
nodes. The bias parame- 
ters are denoted by links com- 
ing from additional input and 
hidden variables zo and zo 
which are themselves denoted 
by solid nodes. Arrows denote 
the direction of information flow 
through the network during for- 
ward propagation. 


20 


function f(-) to give a set of network outputs yx. A two-layer neural network can be 
represented in diagram form as shown in Figure 6.9. 


6.2.1 Parameter matrices 


As we discussed in the context of linear regression models, the bias parameters 
in (6.7) can be absorbed into the set of weight parameters by defining an additional 
input variable x, whose value is clamped at zo = 1, so that (6.7) takes the form 


D 
ig) we (6.10) 
i=0 


We can similarly absorb the second-layer biases into the second-layer weights, so 
that the overall network function becomes 


M D 
Yk (x, w) = f Suh (>. oi?) : (6.11) 
j=0 i=0 


Another notation that will prove convenient at various points in the book is to repre- 


sent the inputs as a column vector x = (z1, .. . , £y)" and then to gather the weight 
and bias parameters in (6.11) into matrices to give 
y(x, w) = f (Wn (Wx)) (6.12) 


where f(-) and h(-) are evaluated on each vector element separately. 


6.2.2 Universal approximation 


The capability of a two-layer network to model a broad range of functions is 
illustrated in Figure 6.10. This figure also shows how individual hidden units work 
collaboratively to approximate the final function. The role of hidden units in a simple 
classification problem is illustrated in Figure 6.11. 
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Figure 6.10 Illustration of the ca- 
pability of a two-layer neural network 
to approximate four different func- 
tions: (a) f(z) = x°, (b) f(x) = 
sin(x), (c), f(x) = |z|, and (d) 
f(x) = H(x) where H(z) is the 
Heaviside step function. In each 
case, N = 50 data points, shown as 
blue dots, have been sampled uni- 
formly in x over the interval (—1, 1) 
and the corresponding values of 
f(x) evaluated. These data points 
are then used to train a two-layer 
network having three hidden units 
with tanh activation functions and 
linear output units. The resulting 
network functions are shown by the 
red curves, and the outputs of the 
three hidden units are shown by the 


three dashed curves. 
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The approximation properties of two-layer feed-forward networks were widely 
studied in the 1980s, with various theorems showing that, for a wide range of activa- 
tion functions, such networks can approximate any function defined over a continu- 
ous subset of R? to arbitrary accuracy (Funahashi, 1989; Cybenko, 1989; Hornik, 
Stinchcombe, and White, 1989; Leshno et al., 1993). A similar result holds for func- 
tions from any finite-dimensional discrete space to any another. Neural networks are 
therefore said to be universal approximators. 

Although such theorems are reassuring, they tell us only that there exists a net- 
work that can represent the required function. In some cases, they may require net- 
works that have an exponentially large number of hidden units. Moreover, they say 
nothing about whether such a network can be found by a learning algorithm. Fur- 
thermore, we will see later that the no free lunch theorem says that we can never find 
a truly universal machine learning algorithm. Finally, although networks having two 
layers of weights are universal approximators, in a practical application, there can 
be huge benefits in considering networks having many more than two layers that can 
learn hierarchical internal representations. All these points support the drive towards 
deep learning. 


6.2.3 Hidden unit activation functions 


We have seen that the activation functions for the output units are determined 
by the kind of distribution being modelled. For the hidden units, however, the only 
requirement is that they need to be differentiable, which leaves a wide range of pos- 
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Example of the solution of a sim- 
ple two-class classification prob- 
lem involving synthetic data us- 
ing a neural network having two 
inputs, two hidden units with tanh 
activation functions, and a single 
output having a logistic-sigmoid 
activation function. The dashed 
blue lines show the z = 0.5 con- 
tours for each of the hidden units, 
and the red line shows the y = 
0.5 decision surface for the net- 
work. For comparison, the green 
lines denote the optimal decision 
boundary computed from the dis- 
tributions used to generate the 
data. 


sibilities. In most cases, all the hidden units in a network will be given the same 
activation function, although in principle there is no reason why different choices 
could not be applied in different parts of the network. 

The simplest option for a hidden unit activation function is the identity function, 
which means that all the hidden units become linear. However, for any such network, 
we can always find an equivalent network without hidden units. This follows from 
the fact that the composition of successive linear transformations is itself a linear 
transformation, and so its representational capability is no greater than that of a sin- 
gle linear layer. However, if the number of hidden units is smaller than either the 
number of input or output units, then the transformations that such a network can 
generate are not the most general possible linear transformation from inputs to out- 
puts because information is lost in the dimensionality reduction at the hidden units. 
Consider a network with N inputs, M hidden units, and K outputs, and where all ac- 
tivation functions are linear. Such a network has M(N + K) parameters, whereas a 
linear transformation of inputs directly to outputs would have N K parameters. If M 
is small relative to N or K, or both, this leads to a two-layer linear network having 
fewer parameters than the direct linear mapping, corresponding to a rank-deficient 
transformation. Such ‘bottleneck’ networks of linear units corresponds to a standard 
data analysis technique called principal component analysis. In general, however, 
there is limited interest in using multilayer networks of linear units since the overall 
function computed by such a network is still linear. 

A simple, nonlinear differentiable function is the logistic sigmoid given by 


1 


a ear (6.13) 


o(a) 


which is plotted in Figure 5.12. This was widely used in the early years of work on 
multilayer neural networks and was partly inspired by studies of the properties of 
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Figure 6.12 A variety of nonlinear activation functions. 
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biological neurons. A closely related function is tanh, which is defined by 


e*—e % 


——_—_. 6.14 
et +e’ One) 


tanh(a) = 
which is plotted in Figure 6.12(a). This function differs from the logistic sigmoid 
by a linear transformation of its input and its output values, and so for any network 
with logistic-sigmoid hidden-unit activation functions, there is an equivalent network 
with tanh activation functions. However, when training a network, these are not 
necessarily equivalent because for gradient-based optimization, the network weights 
and biases need to be initialized, and so if the activation functions are changed, then 
the initialization scheme must be adjusted accordingly. A ‘hard’ version of the tanh 
function (Collobert, 2004) is given by 


h(a) = max (—1, min(1,a)) (6.15) 


and is plotted in Figure 6.12(b). 

A major drawback of both the logistic sigmoid and the tanh activation functions 
is that the gradients go to zero exponentially when the inputs have either large pos- 
itive or large negative values. We will discuss this ‘vanishing gradients’ issue later, 
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but for the moment, we note that it will generally be better to use activation func- 
tions with non-zero gradients, at least when the input takes a large positive value. 
One such choice is the softplus activation function given by 


h(a) = ln (1 + exp(a)) , (6.16) 


which is plotted in Figure 6.12(c). For a >> 1, we have h(a) ~ a, and so the gradient 
remains non-zero even when the input to the activation function is large and positive, 
thereby helping to alleviate the vanishing gradients problem. 
An even simpler choice of activation function is the rectified linear unit or ReLU, 
which is defined by 
h(a) = max(0, a) (6.17) 


and which is plotted in Figure 6.12(d). Empirically, this is one of the best-performing 
activation functions, and it is in widespread use. Note that strictly speaking, the 
derivative of the ReLU function is not defined when a = 0, but in practice this can 
be safely ignored. The softplus function (6.16) can be viewed as a smoothed version 
of the ReLU and is therefore also sometimes called soft ReLU. 

Although the ReLU has a non-zero gradient for positive input values, this is 
not the case for negative inputs, which can mean that some hidden units receive no 
‘error signal’ during training. A modification of ReLU that seeks to avoid this issue 
is called a leaky ReLU and is defined by 


h(a) = max(0,a) + amin(0, a), (6.18) 


where 0 < a < 1. This function is plotted in Figure 6.12(e). Unlike ReLU, this has a 
nonzero gradient for input values a < 0, which ensures that there is a signal to drive 
training. A variant of this activation function uses a = —1, in which case h(a) = |al, 
which is plotted in Figure 6.12(f). Another variant allows each hidden unit to have its 
own value œj, which can be learned during network training by evaluating gradients 
with respect to the {a,;} along with the gradients with respect to the weights and 
biases. 

The introduction of ReLU gave a big improvement in training efficiency over 
previous sigmoidal activation functions (Krizhevsky, Sutskever, and Hinton, 2012). 
As well as allowing deeper networks to be trained efficiently, it is much less sensitive 
to the random initialization of the weights. It is also well suited to a low-precision 
implementation, such as 8-bit fixed versus 64-bit floating point, and it is computa- 
tionally cheap to evaluate. Many practical applications simply use ReLU units as 
the default unless the goal is explicitly to explore the effects of different choices of 
activation function. 


6.2.4 Weight-space symmetries 


One property of feed-forward networks is that multiple distinct choices for the 
weight vector w can all give rise to the same mapping function from inputs to outputs 
(Chen, Lu, and Hecht-Nielsen, 1993). Consider a two-layer network of the form 
shown in Figure 6.9 with M hidden units having tanh activation functions and full 
connectivity in both layers. If we change the sign of all the weights and the bias 
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feeding into a particular hidden unit, then, for a given input data point, the sign 
of the pre-activation of the hidden unit will be reversed, and therefore so too will 
the activation, because tanh is an odd function, so that tanh(—a) = — tanh(a). 
This transformation can be exactly compensated for by changing the sign of all the 
weights leading out of that hidden unit. Thus, by changing the signs of a particular 
group of weights (and a bias), the input-output mapping function represented by 
the network is unchanged, and so we have found two different weight vectors that 
give rise to the same mapping function. For M hidden units, there will be M such 
‘sign-flip’ symmetries, and thus, any given weight vector will be one of a set 2” 
equivalent weight vectors . 

Similarly, imagine that we interchange the values of all of the weights (and the 
bias) leading both into and out of a particular hidden unit with the corresponding 
values of the weights (and bias) associated with a different hidden unit. Again, this 
clearly leaves the network input-output mapping function unchanged, but it cor- 
responds to a different choice of weight vector. For M hidden units, any given 
weight vector will belong to a set of M x (M — 1) x --- x 2 x 1 = M! equivalent 
weight vectors associated with this interchange symmetry, corresponding to the M! 
different orderings of the hidden units. The network will therefore have an overall 
weight-space symmetry factor of M!2™. For networks with more than two layers 
of weights, the total level of symmetry will be given by the product of such factors, 
one for each layer of hidden units. 

It turns out that these factors account for all the symmetries in weight space 
(except for possible accidental symmetries due to specific choices for the weight 
values). Furthermore, the existence of these symmetries is not a particular property 
of the tanh function but applies to a wide range of activation functions (Kurkova and 
Kainen, 1994). In general, these symmetries in weight space are of little practical 
consequence, since network training aims to find a specific setting for the parameters, 
and the existence of other, equivalent, settings is of little consequence. However, 
weight-space symmetries do play a role when Bayesian methods are used to evaluate 
the probability distribution over networks of different sizes (Bishop, 2006). 


Deep Networks 


We have motivated the development of neural networks by making the basis func- 
tions of a linear regression or classification model themselves be governed by learn- 
able parameters, giving rise to the two-layer network model shown in Figure 6.9. For 
many years, this was the most widely used architecture, primarily because it proved 
difficult to train networks with more than two layers effectively. However, extend- 
ing neural networks to have more than two layers, known as deep neural networks, 
brings many advantages as we will discuss shortly, and recent advances in techniques 
for training neural networks are effective for networks with many layers. 

We can easily extend the two-layer network architecture (6.12) to any finite num- 
ber L of layers, in which layer l = 1,..., L computes the following function: 


2) = hO (Wg) (6.19) 
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where h® denotes the activation function associated with layer l, and W“) denotes 
the corresponding matrix of weight and bias parameters. Also, z®™ = x represents 
the input vector and z+) = y represents the output vector. 

Note that there has been some confusion in the literature regarding the termi- 
nology for counting the number of layers in such networks. Thus, the network in 
Figure 6.9 is sometimes described as a three-layer network (which counts the num- 
ber of layers of units and treats the inputs as units) or sometimes as a single-hidden- 
layer network (which counts the number of layers of hidden units). We recommend 
a terminology in which Figure 6.9 is called a two-layer network, because it is the 
number of layers of learnable weights that is important for determining the network 
properties. 

We have seen that a network of the form shown in Figure 6.9, having two layers 
of learnable parameters, has universal approximation capabilities. However, net- 
works with more than two layers can sometimes represent a given function with far 
fewer parameters than a two-layer network. Montúfar et al. (2014) show that the 
network function divides the input space into a number of regions that is exponential 
in the depth of the network, but which is only polynomial in the width of the hidden 
layers. To represent the same function using a two-layer network would require an 
exponential number of hidden units. 


6.3.1 Hierarchical representations 


Although this is an interesting result, a more compelling reason to explore deep 
neural networks is that the network architecture encodes a particular form of induc- 
tive bias, namely that the outputs are related to the input space through a hierarchical 
representation. A good example is the task of recognizing objects in images. The 
relationship between the pixels of an image and a high-level concept such as ‘cat’ is 
highly complex and nonlinear, and would be an extremely challenging problem for 
a two-layer network. However, a deep neural network can learn to detect low-level 
features, such as edges, in the early layers, and can then combine these in subse- 
quent layers to make higher-level features such as eyes or whiskers, which in turn 
can be combined in later layers to detect the presence of a cat. This can be viewed 
as a compositional inductive bias, in which higher-level objects, such as a cat, are 
composed of lower-level objects, such as eyes, which in turn have yet lower-level el- 
ements such as edges. We can also think of this in reverse by considering the process 
of generating an image starting with low-level features such as edges, then combin- 
ing these to form simple shapes such as circles, and then combining those in turn to 
form higher-level objects such as cats. At each stage there are many ways to com- 
bine different components, giving an exponential gain in the number of possibilities 
with increasing depth. 


6.3.2 Distributed representations 


Neural networks can take advantage of another form of compositionality called 
a distributed representation. Conceptually, each unit in a hidden layer can be thought 
of as representing a ‘feature’ at that level of the network, with a high value of the 
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activation indicating that the corresponding feature is present and a low value indi- 
cating its absence. With M units in a given layer, such a network can represent M 
different features. However, the network could potentially learn a different represen- 
tation in which combinations of hidden units represent features, thereby potentially 
allowing a hidden layer with M units to represent 2” different features, growing 
exponentially with the number of units. Consider, for example, a network designed 
to process images of faces. Each particular face image may or may not have glasses, 
it may or may not have a hat, and it may or may not have a beard, leading to eight 
different combinations. Although this could be represented by eight units each of 
which ‘turns on’ when it detects the corresponding combination, it could also be 
represented more compactly by just three units, one for each attribute. These can be 
present independently of each other (although statistically their presence is likely to 
be correlated to some degree). Later, we will explore in detail the kinds of internal 
representations that deep learning networks discover for themselves during training. 


6.3.3 Representation learning 


We can view the successive layers of a deep neural network as performing trans- 
formations of the data, that make it easier to solve the desired task or tasks. For 
example, a neural network that successfully learns to classify skin lesions as benign 
or malignant must have learned to transform the original image data into a new space, 
represented by the outputs of the final layer of hidden units, such that the final layer 
of the network can distinguish the two classes. This final layer can be viewed as a 
simple linear classifier, and so in the representation of the last hidden layer, the two 
classes must be well separated by a linear surface. This ability to discover a nonlin- 
ear transformation of the data that makes subsequent tasks easier to solve is called 
representation learning (Bengio, Courville, and Vincent, 2012). The learned repre- 
sentation, sometimes called the embedding space, is given by the outputs of one of 
the hidden layers of the network, so that any input vector, either from the training set 
or from some new data set, can be transformed into this representation by forward 
propagation through the network. 

Representation learning is especially powerful because it allows us to exploit 
unlabelled data. Often it is easy to collect a large quantity of unlabelled data, but 
acquiring the associated labels may be more difficult. For example, a video camera 
on a vehicle can gather large numbers of images of urban scenes as the vehicle is 
driven around a city, but taking those images and identifying relevant objects, such 
as pedestrians and road signs, would require expensive and time-consuming human 
labelling. 

Learning from unlabelled data is called unsupervised learning, and many differ- 
ent algorithms have been developed to do this. For example, a neural network can be 
trained to take images as input and to create the same images as the output. To make 
this into a non-trivial task, the network may use hidden layers with fewer units than 
the number of pixels in the image, thereby forcing the network to learn some kind of 
compression of the images. Only unlabelled data is needed because each image in 
the training set acts as both the input vector and the target vector. Such networks are 
known as autoencoders. The goal is that this type of training will force the network 
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to discover some internal representation for the data that is useful for solving other 
tasks, such as image classification. 

Historically, unsupervised learning played an important role in enabling the first 
deep networks (apart from convolutional networks) to be successfully trained. Each 
layer of the network was first pre-trained using unsupervised learning and then the 
entire network was trained further using gradient-based supervised training. It was 
later discovered that the pre-training phase could be omitted and a deep network 
could be trained from scratch purely using supervised learning given appropriate 
conditions. 

However, pre-training and representation learning remain central to deep learn- 
ing in other contexts. The most notable example of pre-training is in natural language 
processing in which transformer models are trained on large quantities of text and are 
able to learn highly sophisticated internal representations of language that facilitates 
an impressive range of capabilities at human level and beyond. 


6.3.4 Transfer learning 


The internal representation learned for one particular task might also be useful 
for related tasks. For example, a network trained on a large labelled data set of 
everyday objects can learn how to transform an image representation into one that is 
much better suited for classifying objects. Then, the final classification layer of the 
network can be retrained using a smaller labelled data set of skin lesion images to 
create a lesion classifier. This is an example of transfer learning (Hospedales et al., 
2021), which allows higher accuracy to be achieved than if only the lesion image 
data were used for training, because the network can exploit commonalities shared 
by natural images in general. Transfer learning is illustrated in Figure 6.13. 

In general, transfer learning can be used to improve performance on some task 
A, for which training data is in short supply, by using data from a related task B, for 
which data is more plentiful. The two tasks should have the same kind of inputs, 
and there should be some commonality between the tasks so that low-level features, 
or internal representations, learned from task B will be useful for task A. When we 
look at convolutional networks we will see that many image processing tasks require 
similar low-level features corresponding to the early layers of a deep neural network, 
whereas later layers are more specialized to a particular task, making such networks 
well suited to transfer learning applications. 

When data for task A is very scarce, we might simply re-train the final layer of 
the network. In contrast, if there are more data points, it is feasible to retrain several 
layers. The process of learning parameters using one task that are then applied to 
one or more other tasks is called pre-training. Note that for the new task, instead 
of applying stochastic gradient descent to the whole network, it is much more ef- 
ficient to send the new training data once through the fixed pre-trained network so 
as to evaluate the training inputs in the new representation. Iterative gradient-based 
optimization can then be applied just to the smaller network consisting of the final 
layers. As well as using a pre-trained network as a fixed pre-processor for a different 
task, it is also possible to apply fine-tuning in which the whole network is adapted to 
the data for task A. This is generally done with a very small learning rate for a lim- 
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Figure 6.13 Schematic illustration of transfer learning. (a) A network is first trained on a task with abundant 
data, such as object classification of natural images. (b) The early layers of the network (shown in red) are 
copied from the first task and the final few layers of the network (shown in blue) are then retrained on a new task 
such as skin lesion classification for which training data is more scarce. 


ited number of iterations to ensure that the network does not over-fit to the relatively 
small data set available for the new task. 

A related approach is multitask learning (Caruana, 1997) in which a network 
jointly learns more than one related task at the same time. For example, we might 
wish to construct a spam email filter that allows different users to have different 
classifiers tuned to their particular preferences. The training data may comprise ex- 
amples of spam email and non-spam email for many different users, but the number 
of examples for any one user may be quite limited, and therefore training a separate 
classifier for each user would give poor results. Instead, we can combine the data 
sets to train a single larger network that might, for example, share early layers but 
have separate learnable parameters for the different users in later layers. Sharing 
data across tasks allows the network to exploit commonalities amongst the tasks, 
thereby improving the accuracy for all users. With a large number of training exam- 
ples, a deeper network with more parameters can be used, again leading to improved 
performance. 

Learning across multiple tasks can be extended to meta-learning, which is also 
called learning to learn. Whereas multitask learning aims to make predictions for 
a fixed set of tasks, the aim of meta-learning is to make predictions for future tasks 
that were not seen during training. This can be done by not only learning a shared 
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internal representation across tasks but also by learning the learning algorithm itself 
(Hospedales et al., 2021). Meta-learning can be used to facilitate generalization of, 
for example, a classification model to new classes when there are very few labelled 
examples of the new classes. This is referred to as few-shot learning. When only a 
single labelled example is used it is called one-shot learning. 


6.3.5 Contrastive learning 


One of the most common and powerful representation learning methods is con- 
trastive learning (Gutmann and Hyvärinen, 2010; Oord, Li, and Vinyals, 2018; 
Chen, Kornblith, et al., 2020). The idea is to learn a representation such that cer- 
tain pairs of inputs, referred to as positive pairs, are close in the embedding space, 
and other pairs of inputs, called negative pairs, are far apart. The intuition is that 
if we choose our positive pairs in such a way that they are semantically similar and 
choose negative pairs that are semantically dissimilar, then we will learn a represen- 
tation space in which similar inputs are close, making downstream tasks, such as 
classification, much easier. As with other forms of representation learning, the out- 
puts of the trained network are typically not used directly, and instead the activations 
at some earlier layer are used to form the embedding space. Contrastive learning is 
unlike most other machine learning tasks, in that the error function for a given input 
is defined only with respect to other inputs, instead of having a per-input label or 
target output. 

Suppose we have a given data point x called the anchor, for which we have spec- 
ified another data point x* that together with x makes up a positive pair. We must 
also specify a set of data points {x] ,...,x);} each of which makes up a negative 
pair with x. We now need a loss function that will reward close proximity between 
the representations of x and x* while encouraging a large distance between each pair 
{x,x, }. One example of such a function, and the most commonly used loss func- 
tion for contrastive learning, is called the InfoNCE loss (Gutmann and Hyvärinen, 
2010; Oord, Li, and Vinyals, 2018), where NCE denotes ‘noise contrastive estima- 
tion’. Suppose we have a neural network function fy, (x) that maps points from the 
input space x to a representation space, governed by learnable parameters w. This 
representation is normalized so that ||f,,(x)|| = 1. Then, for a data point x, the 
InfoNCE loss is defined by 


exp{fw(x)T fw (xt)} . 
exp{ fw (X)T Ew (x+)} + Di exp{ fw (x)T Ew (xn )} 


We can see that in this function, the cosine similarity fw(x)Tfw(x") between the 
representation f,,(x) of the anchor and the representation f,,(x*) of the positive 
example provides our measure of how close the positive pair examples are in the 
learned space, and the same measure is used to assess how close the anchor is to the 
negative examples. Note that the function resembles a classification cross-entropy 
error function in which the cosine similarity of the positive pair gives the logit for 
the label class and the cosine similarities for the negative pairs give the logits for the 
incorrect classes. Also note that the negative pairs are crucial as without them the 


E(w) =—-In (6.20) 


192 6. DEEP NEURAL NETWORKS 


Section 9.1.3 


embedding would simply learn the degenerate solution of mapping every point to the 
same representation. 

A particular contrastive learning algorithm is defined predominantly by how the 
positive and negative pairs are chosen, which is how we use our prior knowledge to 
specify what a good representation should be. For example, consider the problem of 
learning representations of images. Here, a common choice is to create positive pairs 
by corrupting the input images in ways that should preserve the semantic information 
of the image while greatly altering the image in the pixel space (Wu et al., 2018; He 
et al., 2019; Chen, Kornblith, et al., 2020). Corruptions are closely related to data 
augmentations, and examples include rotation, translation, and colour shifts. Other 
images from the data set can then be used to create the negative pairs. This approach 
to contrastive learning is known as instance discrimination. 

If, however, we have access to class labels, then we can use images of the same 
class as positive pairs and images of different classes as negative pairs. This re- 
laxes the reliance on specifying the augmentations that the representation should be 
invariant to and also avoids treating two semantically similar images as a negative 
pair. This is referred to as supervised contrastive learning (Khosla et al., 2020) be- 
cause of the reliance on the class labels, and it can often yield better results than 
simply learning the representation using cross-entropy classification. 

The members of positive and negative pairs do not necessarily have to come 
from the same data modality. In contrastive-language image pretraining, or CLIP 
(Radford et al., 2021), a positive pair consists of an image and its corresponding 
text caption, and two separate functions, one for each modality, are used to map the 
inputs to the same representation space. Negative pairs are then mismatched images 
and captions. This is often referred to as weakly supervised, as it relies on captioned 
images, which are often easier to obtain by scraping data from the internet than by 
manually labelling images with their classes. The loss function in this case is given 
by 


1 exp{fw(x") "go(y*)} 
E(w) In N = 
2 exp{fw(xt)Tgo(y+)} + Dn- exp{fw (xn )Tga(yt)} 
1 In exp{fw(x")"go(y+)} (6.21) 


2 exp{fw(x+)Tgo(y+)} + Dm exp{fw(x+)Tgo(ym)} 


where x* and y* represent a positive pair in which x is an image and y is its corre- 
sponding text caption, fw represents the mapping from images to the representation 
space, and gg is the mapping from text input to the representation space. We also 
require a set {x, ,...,X,} of other images from the data set, for which we can 
assume the text caption yt is inappropriate, and a set {y} ,...,y,} of text cap- 
tions that are similarly mismatched to the input image x. The two terms in the loss 
function ensure that (a) the representation of the image is close to its text caption 
representation relative to other image representations and (b) the text caption rep- 
resentation is close to the representation of the image it describes relative to other 
representations of text captions. Although CLIP uses text and image pairs, any data 
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Figure 6.14 Illustration of three different contrastive learning paradigms. (a) The instance discrimination ap- 
proach, where the positive pair is made up of the anchor and an augmented version of the same image. These 
are mapped to points in a normalized space that can be thought of as a unit hypersphere. The coloured arrows 
show that the loss encourages the representations of the positive pair to be closer together but pushes negative 
pairs further apart. (b) Supervised contrastive learning in which the positive pair consists of two different images 
from the same class. (c) The CLIP model in which the positive pair is made up of an image and an associated 
text snippet. 


set with paired modalities can be used to learn representations. A comparison of the 
different contrastive learning methods we have discussed is shown in Figure 6.14. 


6.3.6 General network architectures 


So far, we have explored neural network architectures that are organized into a 
sequence of fully-connected layers. However, because there is a direct correspon- 
dence between a network diagram and its mathematical function, we can develop 
more general network mappings by considering more complex network diagrams. 
These must be restricted to a feed-forward architecture, in other words to one having 
no closed directed cycles, to ensure that the outputs are deterministic functions of 
the inputs. This is illustrated with a simple example in Figure 6.15. Each (hidden or 
output) unit in such a network computes a function given by 


zr=h| X. wrjzj+br (6.22) 
jEA(k) 


where A(k) denotes the set of ancestors of node k, in other words the set of units 
that send connections to unit k, and b; denotes the associated bias parameter. For 
a given set of values applied to the inputs of the network, successive application of 
(6.22) allows the activations of all units in the network to be evaluated including 
those of the output units. 
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6.3.7 Tensors 


We see that linear algebra plays a central role in neural networks, with quantities 
such as data sets, activations, and network parameters represented as scalars, vectors, 
and matrices. However, we also encounter variables of higher dimensionality. Con- 
sider, for example, a data set of N colour images each of which is J pixels high and 
J pixels wide. Each pixel is indexed by its row and column within the image and 
has red, green, and blue values. We have one such value for each image in the data 
set, and so we can represent a particular intensity value by a four-dimensional array 
X with elements x;;,, where i € {1,..., I} and j € {1,..., J} index the row 
and column within the image, k € {1,2,3} indexes the red, green, and blue inten- 
sities, and n € {1,..., N} indexes the particular image within the data set. These 
higher-dimensional arrays are called tensors and include scalars, vectors, and matri- 
ces as special cases. We will see many examples of such tensors when we discuss 
more sophisticated neural network architectures later in the book. Massively parallel 
processors such as GPUs are especially well suited to processing tensors. 


Error Functions 


In earlier chapters, we explored linear models for regression and classification, and 
in the process we derived suitable forms for the error functions along with corre- 
sponding choices for the output-unit activation function. The same considerations 
for choosing an error function apply for multilayer neural networks, and so for con- 
venience, we will summarize the key points here. 


6.4.1 Regression 


We start by discussing regression problems, and for the moment we consider 

a single target variable t that can take any real value. Following the discussion of 

regression in single-layer networks, we assume that t has a Gaussian distribution 

with an x-dependent mean, which is given by the output of the neural network, so 
that 

p(t|x, w) = N (t\y(x, w), 07) (6.23) 
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where o? is the variance of the Gaussian noise. Of course this is a somewhat restric- 
tive assumption, and in some applications we will need to extend this approach to 
allow for more general distributions. For the conditional distribution given by (6.23), 
it is sufficient to take the output-unit activation function to be the identity, because 
such a network can approximate any continuous function from x to y. Given a data 


set of N i.i.d. observations X = {x,,...,x}, along with corresponding target 
values t = {t,,...,¢~}, we can construct the corresponding likelihood function: 
N 
pX, w, 0°) = | | p(tnly(Xn,w), 0”). (6.24) 
n=1 


Note that in the machine learning literature, it is usual to consider the minimization 
of an error function rather than the maximization of the likelihood, and so here we 
will follow this convention. Taking the negative logarithm of the likelihood function 
(6.24), we obtain the error function 


N 

i N N 

D W) — ta}? + a > In(2n), (6.25) 
n=1 


which can be used to learn the parameters w and o”. Consider first the determination 
of w. Maximizing the likelihood function is equivalent to minimizing the sum-of- 
squares error function given by 


Lx 
E(w) = J X ul(xn w) — tn}? (6.26) 


where we have discarded additive and multiplicative constants. The value of w 
found by minimizing E(w) will be denoted w*. Note that this will typically not 
correspond to the global maximum of the likelihood function because the nonlin- 
earity of the network function y(x,,, w) causes the error E(w) to be non-convex, 
and so finding the global optimum is generally infeasible. Moreover, regularization 
terms may be added to the error function and other modifications may be made to 
the training process, so that the resulting solution for the network parameters may 
differ significantly from the maximum likelihood solution. 

Having found w*, the value of o? can be found by minimizing the error function 
(6.25) to give 


N 
1 
o* = N 2 {ulin w”) — tn ¥. (6.27) 


Note that this can be evaluated once the iterative optimization required to find w* is 
completed. 

If we have multiple target variables, and we assume that they are independent, 
conditional on x and w, with shared noise variance g?, then the conditional distri- 
bution of the target values is given by 


p(t|x, w) =N (tly(x, w), o’°T). (6.28) 
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Following the same argument as for a single target variable, we see that maximizing 
the likelihood function with respect to the weights is equivalent to minimizing the 
sum-of-squares error function: 


-X lly(Xn, w) — till: (6.29) 


The noise variance is then given by 


— > lly(Xn,w*) — tnl? (6.30) 


where K is the dimensionality of the target variable. The assumption of conditional 
independence of the target variables can be dropped at the expense of a slightly more 
complex optimization problem. 

Recall that there is a natural pairing of the error function (given by the negative 
log likelihood) and the output-unit activation function. In regression, we can view the 
network as having an output activation function that is the identity, so that yk = ag. 
The corresponding sum-of-squares error function then has the property 


ðE 
Z = ye te (6.31) 
ak 


6.4.2 Binary classification 


Now consider binary classification in which we have a single target variable 
t such that t = 1 denotes class Cı and t = 0 denotes class Cə. Following the 
discussion of canonical link functions, we consider a network having a single output 
whose activation function is a logistic sigmoid (6.13) so that 0 < y(x, w) < 1. We 
can interpret y(x, w) as the conditional probability p(C,|x), with p(C2|x) given by 
1 — y(x, w). The conditional distribution of targets given inputs is then a Bernoulli 
distribution of the form 


p(t|x, w) = y(x, w) {1 — y(x,w)}* ©. (6.32) 


If we consider a training set of independent observations, then the error function, 
which is given by the negative log likelihood, is then a cross-entropy error of the 


form 
N 


E(w) =— X {ta lnyn + (1 — tn) n(1 — yn)} (6.33) 
n=1 

where yn denotes y(Xn, w). Simard, Steinkraus, and Platt (2003) found that using 
the cross-entropy error function instead of the sum-of-squares for a classification 

problem leads to faster training as well as improved generalization. 
Note that there is no analogue of the noise variance a? in (6.32) because the 
target values are assumed to be correctly labelled. However, the model is easily 
extended to allow for labelling errors by introducing a probability e that the target 
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value t has been flipped to the wrong value (Opper and Winther, 2000). Here €e may 
be set in advance, or it may be treated as a hyperparameter whose value is inferred 
from the data. 

If we have K separate binary classifications to perform, then we can use a net- 
work having K outputs each of which has a logistic-sigmoid activation function. 
Associated with each output is a binary class label t € {0,1}, where k = 1,...,K. 
If we assume that the class labels are independent, given the input vector, then the 
conditional distribution of the targets is 


p(t|x, w) “Il ye (x, w)** [L — y(x, w) *. (6.34) 


Taking the negative logarithm of the corresponding likelihood function then gives 
the following error function: 


N K 
-XOX {tnr nynk + (1 — tne) ln(1 — Yne)} (6.35) 


n=l k=1 


where yng denotes yk(Xn, w). Again, the derivative of the error function with re- 
spect to the pre-activation for a particular output unit takes the form (6.31), just as in 
the regression case. 


6.4.3 multiclass classification 


Finally, we consider the standard multiclass classification problem in which each 
input is assigned to one of K mutually exclusive classes. The binary target variables 
tk E€ {0,1} have a 1-of-k coding scheme indicating the class, and the network 
outputs are interpreted as y;,(x,w) = p(t, = 1|x), leading to the error function 
(5.80), which we reproduce here: 


N K 
A kn In Yk (Xn, W). (6.36) 
n=1 k=1 


The output-unit activation function, which corresponds to the canonical link, is given 
by the softmax function: 


exp(ag(x,w)) 


, (6.37) 
X exp(a;(x, w)) 
j 


Yk (x, w) = 


which satisfies 0 < y < 1 and $`} Yk = 1. Note that the y;,(x, w) are unchanged 
if a constant is added to all of the a(x, w), causing the error function to be constant 
for some directions in weight space. This degeneracy is removed if an appropriate 
regularization term is added to the error function. Once again, the derivative of the 
error function with respect to the pre-activation for a particular output unit takes the 
familiar form (6.31). 
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In summary, there is a natural choice of both output-unit activation function 
and matching error function according to the type of problem being solved. For 
regression, we use linear outputs and a sum-of-squares error, for multiple indepen- 
dent binary classifications, we use logistic sigmoid outputs and a cross-entropy error 
function, and for multi-class classification, we use softmax outputs with the corre- 
sponding multi-class cross-entropy error function. For classification problems in- 
volving two classes, we can use a single logistic sigmoid output, or alternatively, we 
can use a network with two outputs having a softmax output activation function. 

This procedure is quite general, and by considering other forms of conditional 
distribution, we can derive the associated error functions as the corresponding neg- 
ative log likelihood. We will see an example of this in the next section when we 
consider multimodal network outputs. 


Mixture Density Networks 


So far in this chapter we have discussed neural networks whose outputs represent 
simple probability distributions comprising either a Gaussian for continuous vari- 
ables or a binary distribution for discrete variables. We close the chapter by showing 
how a neural network can represent more general conditional probabilities by treat- 
ing the outputs of the network as the parameters of a more complex distribution, in 
this case a Gaussian mixture model. This is known as a mixture density network, 
and we will see how to define the associated error function and the corresponding 
output-unit activation functions. 


6.5.1 Robot kinematics example 


The goal of supervised learning is to model a conditional distribution p(t|x), 
which for many simple regression problems is chosen to be Gaussian. However, 
practical machine learning problems can often have significantly non-Gaussian dis- 
tributions. These can arise, for example, with inverse problems in which the distri- 
bution can be multimodal, in which case the Gaussian assumption can lead to very 
poor predictions. 

As a simple illustration of an inverse problem, consider the kinematics of a robot 
arm, as illustrated in Figure 6.16. The forward problem involves finding the end ef- 
fector position given the joint angles and has a unique solution. However, in practice 
we wish to move the end effector of the robot to a specific position, and to do this we 
must set appropriate joint angles. We therefore need to solve the inverse problem, 
which has two solutions, as seen in Figure 6.16. 

Forward problems often correspond to causality in a physical system and gen- 
erally have a unique solution. For instance, a specific pattern of symptoms in the 
human body may be caused by the presence of a particular disease. In machine 
learning, however, we typically have to solve an inverse problem, such as trying to 
predict the presence of a disease given a set of symptoms. If the forward problem 
involves a many-to-one mapping, then the inverse problem will have multiple solu- 
tions. For instance, several different diseases may result in the same symptoms. 


Figure 6.16 (a) A two-link robot 
arm, in which the Cartesian coor- 
dinates (21,22) of the end effector 
are determined uniquely by the two 
joint angles 6; and 62 and the (fixed) 
lengths Lı and Lə of the arms. This 
is known as the forward kinematics 
of the arm. (b) In practice, we have 
to find the joint angles that will give 
rise to a desired end effector posi- 
tion. This inverse kinematics has 
two solutions corresponding to ‘el- 
bow up’ and ‘elbow down’. 
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In the robotics example, the kinematics is defined by geometrical equations, and 


On the left is the data 


the multimodality is readily apparent. However, in many machine learning problems 
the presence of multimodality, particularly in problems involving spaces of high di- 
mensionality, can be less obvious. For tutorial purposes, however, we will consider 
a simple toy problem for which we can easily visualize the multimodality. The data 
for this problem is generated by sampling a variable x uniformly over the interval 
(0, 1), to give a set of values {x,,}, and the corresponding target values tn are ob- 
tained by computing the function £n + 0.3 sin(27z,,) and then adding uniform noise 
over the interval (—0.1,0.1). The inverse problem is then obtained by keeping the 
same data points but exchanging the roles of x and t. Figure 6.17 shows the data sets 
for the forward and inverse problems, along with the results of fitting two-layer neu- 
ral networks having six hidden units and a single linear output unit by minimizing 
a sum-of-squares error function. Least squares corresponds to maximum likelihood 
under a Gaussian assumption. We see that this leads to a good model for the forward 
problem but a very poor model for the highly non-Gaussian inverse problem. 


6.5.2 Conditional mixture distributions 


We therefore seek a general framework for modelling conditional probability 
distributions. This can be achieved by using a mixture model for p(t|x) in which 


set for a simple forward problem in 1 


which the red curve shows the result 
of fitting a two-layer neural network 
by minimizing the sum-of-squares 
error function. The corresponding 
inverse problem, shown on the right, 
is obtained by exchanging the roles 
of x and t. Here the same network, 
again trained by minimizing the sum- 
of-squares error function, gives a 
poor fit to the data due to the mul- 
timodality of the data set. 
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Figure 6.18 The mixture density net- 
work can represent general conditional 
probability densities p(t|x) by consid- 
ering a parametric mixture model for 
the distribution of t whose parameters 
are determined by the outputs of a neu- 
ral network that takes x as its input 
vector. 


both the mixing coefficients as well as the component densities are flexible functions 
of the input vector x, giving rise to a mixture density network. For any given value of 
x, the mixture model provides a general formalism for modelling an arbitrary condi- 
tional density function p(t|x). Provided we consider a sufficiently flexible network, 
we then have a framework for approximating arbitrary conditional distributions. 
Here we will develop the model explicitly for Gaussian components, so that 


K 
D(tlx) = X m(x) (tog (x), 0% (X)) - (6.38) 
k=1 


This is an example of a heteroscedastic model in which the noise variance on the 
data is a function of the input vector x. Instead of Gaussians, we can use other dis- 
tributions for the components, such as Bernoulli distributions if the target variables 
are binary rather than continuous. We have also specialized to the case of isotropic 
covariances for the components, although the mixture density network can readily 
be extended to allow for general covariance matrices by representing the covariances 
using a Cholesky factorization (Williams, 1996). Even with isotropic components, 
the conditional distribution p(t|x) does not assume factorization with respect to the 
components of t (in contrast to the standard sum-of-squares regression model) as a 
consequence of the mixture distribution. 

We now take the various parameters of the mixture model, namely the mix- 
ing coefficients 7;,(x), the means j1;,(x), and the variances o7(x), to be governed 
by the outputs of a neural network that takes x as its input. The structure of this 
mixture density network is illustrated in Figure 6.18. The mixture density network 
is closely related to the mixture-of-experts model (Jacobs et al., 1991). The prin- 
cipal difference is that a mixture of experts has independent parameters for each 
component model in the mixture, whereas in a mixture density network, the same 
function is used to predict the parameters of all the component densities as well as 
the mixing coefficients, and so the nonlinear hidden units are shared amongst the 
input-dependent functions. 

The neural network in Figure 6.18 can, for example, be a two-layer network 
having sigmoidal (tanh) hidden units. If there are K components in the mixture 
model (6.38), and if t has L components, then the network will have K output- 
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unit pre-activations denoted by a7 that determine the mixing coefficients m(x), K 
outputs denoted by af that determine the Gaussian standard deviations o;,(x), and 
K x L outputs denoted by a} ; that determine the components py; (x) of the Gaussian 
means u(x). The total number of network outputs is given by (L + 2)K, unlike 
the usual L outputs for a network that simply predicts the conditional means of the 
target variables. 

The mixing coefficients must satisfy the constraints 


K 
X m(x) = 1, 0 < m(x) <1, (6.39) 
k=1 


which can be achieved using a set of softmax outputs: 


mm (x) = Spak) 
Xi explaĵ) 


Similarly, the variances must satisfy o? (x) > 0 and so can be represented in terms 
of the exponentials of the corresponding network pre-activations using 


(6.40) 


on(X) = exp(az). (6.41) 


Finally, because the means u(x) have real components, they can be represented 
directly by the network outputs: 


[kj (X) = akj (6.42) 


in which the output-unit activation functions are given by the identity f(a) = a. 

The learnable parameters of the mixture density network comprise the vector w 
of weights and biases in the neural network, which can be set by maximum likelihood 
or equivalently by minimizing an error function defined to be the negative logarithm 
of the likelihood. For independent data, this error function takes the form 


N K 
E(w) =—- `> In o Tk(Xn, W)N (tr |My (Xn, w), o? (Xn, w)) ) (6.43) 
=1 k=1 


where we have made the dependencies on w explicit. 


6.5.3 Gradient optimization 


To minimize the error function, we need to calculate the derivatives of the error 
E(w) with respect to the components of w. We will see later how to compute these 
derivatives automatically. It is instructive, however, to derive suitable expressions for 
the derivatives of the error with respect to the output-unit pre-activations explicitly as 
this highlights the probabilistic interpretation of these quantities. Because the error 
function (6.43) is composed of a sum of terms, one for each training data point, we 
can consider the derivatives for a particular input vector x,, with associated target 
vector tn. The derivatives of the total error E are obtained by summing over all 
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Chapter 7 


Exercise 6.17 


Exercise 6.18 


Exercise 6.19 


Exercise 6.20 


data points, or the individual gradients for each data point can be used directly in 
gradient-based optimization algorithms. 
It is convenient to introduce the following variables: 


TrNak 
K 
Via TNn 


where Nng denotes N (tn| Uy (Xn), 7Z(Kn)). These quantities have a natural inter- 
pretation as posterior probabilities for the components of the mixture in which the 
mixing coefficients T(x) are viewed as x-dependent prior probabilities. 

The derivatives of the error function with respect to the network output pre- 
activations governing the mixing coefficients are given by 


OEn, 
da7. 


Ynk = Ykltn|Xn) = (6.44) 


= Tk — Ynk- (6.45) 


Similarly, the derivatives with respect to the output pre-activations controlling the 
component means are given by 


OEn eti 


Finally, the derivatives with respect to the output pre-activations controlling the com- 
ponent variances are given by 


ðE, tn — Bgl? 
= nk fı = moet | . (6.47) 


ua 
a7 Oo; 


6.5.4 Predictive distribution 


We illustrate the use of a mixture density network by returning to the toy ex- 
ample of an inverse problem shown in Figure 6.17. Plots of the mixing coeffi- 
cients m(x), the means u(x), and the conditional density contours corresponding 
to p(t|x), are shown in Figure 6.19. The outputs of the neural network, and hence the 
parameters in the mixture model, are necessarily continuous single-valued functions 
of the input variables. However, we see from Figure 6.19(c) that the model is able to 
produce a conditional density that is unimodal for some values of x and trimodal for 
other values by modulating the amplitudes of the mixing components 7; (x). 

Once a mixture density network has been trained, it can predict the conditional 
density function of the target data for any given value of the input vector. This 
conditional density represents a complete description of the generator of the data, so 
far as the problem of predicting the value of the output vector is concerned. From this 
density function, we can calculate more specific quantities that may be of interest in 
different applications. One of the simplest of these is the mean, corresponding to the 
conditional average of the target data, and is given by 


K 


E [t|x] = fwaw dt = iD Tk (X) py, (X) (6.48) 


k=1 


Figure 6.19 


(a) Plot of the mixing 
coefficients m(x) as a function of 1 1 
x for the three mixture components 
in a mixture density network trained 
on the data shown in Figure 6.17. 
The model has three Gaussian com- 
ponents and uses a two-layer neu- 
ral network with five tanh sigmoidal 
units in the hidden layer and nine 
outputs (corresponding to the three 
means and three variances of the 0 0 
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Gaussian components and the three 0 1 0 1 


mixing coefficients). 


At both small 


and large values of x, where the (a) (b) 
conditional probability density of the 


target data is unimodal, only one 
of the Gaussian components has 1 
a high value for its prior probabil- 
ity, whereas at intermediate values 
of x, where the conditional density 
is trimodal, the three mixing coeffi- 
cients have comparable values. (b) 
Plots of the means u(x) using the 
same colour coding as for the mix- 
ing coefficients. (c) Plot of the con- 
tours of the corresponding condi- 0 
tional probability density of the tar- 


get data for the same mixture den- 0 1 0 1 


sity network. 


(d) Plot of the ap- (c) (d) 


proximate conditional mode, shown 
by the red points, of the conditional 


density. 


Exercise 6.21 


where we have used (6.38). Because a standard network trained by least squares 
approximates the conditional mean, we see that a mixture density network can re- 
produce the conventional least-squares result as a special case. Of course, as we have 
already noted, for a multimodal distribution the conditional mean is of limited value. 

We can similarly evaluate the variance of the density function about the condi- 
tional average, to give 


s(x) = Efit- Elti]? bx] (6.49) 
K 2 
= So mex) 4 o2(x) + fme -E nom $6.50) 
k=1 l=1 


where we have used (6.38) and (6.48). This is more general than the corresponding 
least-squares result because the variance is a function of x. 

We have seen that for multimodal distributions, the conditional mean can give 
a poor representation of the data. For instance, in controlling the simple robot arm 
shown in Figure 6.16, we need to pick one of the two possible joint angle settings 
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Exercises 
6.1 


6.2 


to achieve the desired end-effector location, but the average of the two solutions is 
not itself a solution. In such cases, the conditional mode may be of more value. 
Because the conditional mode for the mixture density network does not have a sim- 
ple analytical solution, a numerical iteration is required. A simple alternative is to 
take the mean of the most probable component (i.e., the one with the largest mixing 
coefficient) at each value of x. This is shown for the toy data set in Figure 6.19(d). 


(xxx) Use the result (2.126) to derive an expression for the surface area Sp and 
the volume Vp of a hypersphere of unit radius in D dimensions. To do this, con- 
sider the following result, which is obtained by transforming from Cartesian to polar 


coordinates: : 
IL / ei dz; = Sp f en Pl dp. (6.51) 
i=1 7-0 g 


Using the gamma function, defined by 


T(x) = i te dt (6.52) 
0 
together with (2.126), evaluate both sides of this equation, and hence show that 
Q7P/2 
Sp = ———. 6.53 
>= D/A (6.53) 


Next, by integrating with respect to the radius from 0 to 1, show that the volume of 
the unit hypersphere in D dimensions is given by 
= 


Finally, use the results [(1) = 1 and ['(3/2) = \/7/2 to show that (6.53) and (6.54) 
reduce to the usual expressions for D = 2 and D = 3. 


Vp (6.54) 


(«x x) Consider a hypersphere of radius a in D dimensions together with the con- 
centric hypercube of side 2a, so that the hypersphere touches the hypercube at the 
centres of each of its sides. By using the results of Exercise 6.1, show that the ratio 
of the volume of the hypersphere to the volume of the cube is given by 


volume of hypersphere _ qP/? (6.55) 
volume of cube D2? 7 (D/2) f 
Now make use of Stirling’s formula in the form 
D(a +1) x (2r) Pet gt tt/2 (6.56) 


which is valid for x >> 1, to show that, as D — ov, the ratio (6.55) goes to zero. 
Show also that the distance from the centre of the hypercube to one of the corners 


6.3 


6.4 


6.5 
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divided by the perpendicular distance to one of the sides is V D, which therefore goes 
to co as D — oo. From these results, we see that, in a space of high dimensionality, 
most of the volume of a cube is concentrated in the large number of corners, which 
themselves become very long ‘spikes’! 


(x x x) In this exercise, we explore the behaviour of the Gaussian distribution in high- 
dimensional spaces. Consider a Gaussian distribution in D dimensions given by 


1 2 
p(x) = (Ono) DP exp ( eal ) ; (6.57) 


We wish to find the density as a function of the radius in polar coordinates in which 
the direction variables have been integrated out. To do this, show that the integral of 
the probability density over a thin shell of radius r and thickness e, where e < 1, is 
given by p(r)e where 


Spr?-1 r? 


where Sp is the surface area of a unit hypersphere in D dimensions. Show that the 


function p(r) has a single stationary point located, for large D, at? ~ vDo. By 
considering p(7 + €) where € < 7, show that for large D, 


X bo 3e? 
p(T + €) = p(T) exp (-5) ; (6.59) 
20 
which shows that 7 is a maximum of the radial probability density and also that p(r) 
decays exponentially away from its maximum at 7 with length scale ø. We have 
already seen that o < fT for large D, and so we see that most of the probability 
mass is concentrated in a thin shell at large radius. Finally, show that the probability 
density p(x) is larger at the origin than at the radius 7 by a factor of exp(D/2). 
We therefore see that most of the probability mass in a high-dimensional Gaussian 
distribution is located at a different radius from the region of high probability density. 


(x x) Consider a two-layer network function of the form (6.11) in which the hidden- 
unit nonlinear activation functions h(-) are given by logistic sigmoid functions of the 
form 

o(a) = {1+exp(—a)}*. (6.60) 


Show that there exists an equivalent network, which computes exactly the same func- 
tion, but with hidden-unit activation functions given by tanh(a) where the tanh func- 
tion is defined by (6.14). Hint: first find the relation between c(a) and tanh(a), and 
then show that the parameters of the two networks differ by linear transformations. 


(x x) The swish activation function (Ramachandran, Zoph, and Le, 2017) is defined 
by 
h(x) = xo(Bx) (6.61) 
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6.6 


6.7 


6.8 


6.9 


6.10 


6.11 


where o(2) is the logistic-sigmoid activation function defined by (6.13). When used 
in a neural network, 8 can be treated as a learnable parameter. Either sketch or plot 
using software graphs of the swish activation function as well as its first derivative 
for 8 = 0.1, 6 = 1.0, and 6 = 10. Show that when 6 — oo, the swish function 
becomes the ReLU function. 


(x) We saw in (5.72) that the derivative of the logistic-sigmoid activation function 
can be expressed in terms of the function value itself. Derive the corresponding 
result for the tanh activation function defined by (6.14). 


(xx) Show that the softplus activation function ¢(a) given by (6.16) satisfies the 
properties: 


ea) ~ ¢(-a) = a (6.62 
Ino(a) = —¢(—a) (6.63) 

a) = o(a) (6.64) 

¢~1(a) = In (exp(a) — 1) (6.65) 


where o(a) is the logistic-sigmoid activation function given by (6.13). 


(x) Show that minimization of the error function (6.25) with respect to the variance 
a? gives the result (6.27). 


(x) Show that maximizing the likelihood function under the conditional distribu- 
tion (6.28) for a multioutput neural network is equivalent to minimizing the sum-of- 
squares error function (6.29). Also, show that the noise variance that minimizes this 
error function is given by (6.30). 


(x x) Consider a regression problem involving multiple target variables in which it is 
assumed that the distribution of the targets, conditioned on the input vector x, is a 
Gaussian of the form 

p(t|x, w) = N(tly(x, w), £) (6.66) 


where y(x, w) is the output of a neural network with input vector x and weight 
vector w, and © is the covariance of the assumed Gaussian noise on the targets. 
Given a set of independent observations of x and t, write down the error function 
that must be minimized to find the maximum likelihood solution for w, if we assume 
that X is fixed and known. Now assume that © is also to be determined from the data, 
and write down an expression for the maximum likelihood solution for &. Note that 
the optimizations of w and © are now coupled, in contrast to the case of independent 
target variables discussed in Section 6.4.1. 


(xx) Consider a binary classification problem in which the target values are t € 
{0,1}, with a network output y(x, w) that represents p(t = 1|x), and suppose that 
there is a probability e that the class label on a training data point has been incorrectly 
set. Assuming i.i.d. data, write down the error function corresponding to the negative 
log likelihood. Verify that the error function (6.33) is obtained when e = 0. Note that 
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6.14 


6.15 


6.16 


6.17 


6.18 
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this error function makes the model robust to incorrectly labelled data, in contrast to 
the usual cross-entropy error function. 


(x x) The error function (6.33) for binary classification problems was derived for a 
network having a logistic-sigmoid output activation function, so that 0 < y(x,w) < 
1, and data having target values t € {0,1}. Derive the corresponding error function 
if we consider a network having an output —1 < y(x,w) < 1 and target values 
t = 1 for class C4 and t = —1 for class C2. What would be the appropriate choice of 
output-unit activation function? 


(x) Show that maximizing the likelihood for a multi-class neural network model 
in which the network outputs have the interpretation y;(x,w) = p(t, = 1|x) is 
equivalent to minimizing the cross-entropy error function (6.36). 


(x) Show that the derivative of the error function (6.33) with respect to the pre- 
activation a; for an output unit having a logistic-sigmoid activation function y, = 
a(ax), where o(a) is given by (6.13), satisfies (6.31). 


(x) Show that the derivative of the error function (6.36) with respect to the pre- 
activation a, for output units having a softmax activation function (6.37) satisfies 
(6.31). 


(xx) Write down a pair of equations that express the Cartesian coordinates (21, £2) 
for the robot arm shown in Figure 6.16 in terms of the joint angles 0; and @ and 
the lengths Lı and Lə of the links. Assume the origin of the coordinate system is 
given by the attachment point of the lower arm. These equations define the forward 
kinematics of the robot arm. 


(x x) Show that the variable yn defined by (6.44) can be viewed as the posterior 
probabilities p(k|t) for the components of the mixture distribution (6.38) in which 
the mixing coefficients 7;,(x) are viewed as x-dependent prior probabilities p(k). 


(x x) Derive the result (6.45) for the derivative of the error function with respect to 
the network output pre-activations controlling the mixing coefficients in the mixture 
density network. 


(x x) Derive the result (6.46) for the derivative of the error function with respect to 
the network output pre-activations controlling the component means in the mixture 
density network. 


(x x) Derive the result (6.47) for the derivative of the error function with respect to the 
network output pre-activations controlling the component variances in the mixture 
density network. 


(xxx) Verify the results (6.48) and (6.50) for the conditional mean and variance of 
the mixture density network model. 


Section 6.4 


Check for 
updates 


Gradient 
Descent 


In the previous chapter we saw that neural networks are a very broad and flexible 
class of functions and are able in principle to approximate any desired function to 
arbitrarily high accuracy given a sufficiently large number of hidden units. More- 
over, we saw that deep neural networks can encode inductive biases corresponding 
to hierarchical representations, which prove valuable in a wide range of practical 
applications. We now turn to the task of finding a suitable setting for the network 
parameters (weights and biases), based on a set of training data. 

As with the regression and classification models discussed in earlier chapters, 
we choose the model parameters by optimizing an error function. We have seen how 
to define a suitable error function for a particular application by using maximum 
likelihood. Although in principle the error function could be minimized numerically 
through a series of direct error function evaluations, this turns out to be very ineffi- 
cient. Instead, we turn to another core concept that is used in deep learning, which 
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is that optimizing the error function can be done much more efficiently by making 
use of gradient information, in other words by evaluating the derivatives of the error 
function with respect to the network parameters. This is why we took care to en- 
sure that the function represented by the neural network is differentiable by design. 
Likewise, the error function itself also needs to be differentiable. 

The required derivatives of the error function with respect to each of the pa- 
rameters in the network can be evaluated efficiently using a technique called back- 
propagation, which involves successive computations that flow backwards through 
the network in a way that is analogous to the forward flow of function computations 
during the evaluation of the network outputs. 

Although the likelihood is used to define an error function, the goal when op- 
timizing the error function in a neural network is to achieve good generalization on 
test data. In classical statistics, maximum likelihood is used to fit a parametric model 
to a finite data set, in which the number of data points typically far exceeds the num- 
ber of parameters in the model. The optimal solution has the maximum value of the 
likelihood function, and the values found for the fitted parameters are of direct inter- 
est. By contrast, modern deep learning works with very rich models containing huge 
numbers of learnable parameters, and the goal is never simply exact optimization. 
Instead, the properties and behaviour of the learning algorithm itself, along with var- 
ious methods for regularization, are important in determining how well the solution 
generalizes to new data. 


Error Surfaces 


Our goal during training is to find values for the weights and biases in the neural 
network that will allow it to make effective predictions. For convenience we will 
group these parameters into a single vector w, and we will optimize w by using a 
chosen error function E(w). At this point, it is useful to have a geometrical picture 
of the error function, which we can view as a surface sitting over ‘weight space’, as 
shown in Figure 7.1. 

First note that if we make a small step in weight space from w to w + dw then 
the change in the error function is given by 


ôE ~ bw! VE(w) (7.1) 


where the vector V E(w) points in the direction of the greatest rate of increase of 
the error function. Provided the error E(w) is a smooth, continuous function of w, 
its smallest value will occur at a point in weight space such that the gradient of the 
error function vanishes, so that 


VE(w) =0 (7.2) 


as otherwise we could make a small step in the direction of -V E(w) and thereby 
further reduce the error. Points at which the gradient vanishes are called stationary 
points and may be further classified into minima, maxima, and saddle points. 


Figure 7.1 Geometrical view of the error function E(w) asa 
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surface sitting over weight space. Point wy is a 
local minimum and wz is the global minimum, so 
that E(wa) > E(ws). At any point wc, the local 
gradient of the error surface is given by the vector 
VE. 


w2 VE 


We will aim to find a vector w such that E(w) takes its smallest value. How- 
ever, the error function typically has a highly nonlinear dependence on the weights 
and bias parameters, and so there will be many points in weight space at which the 
gradient vanishes (or is numerically very small). Indeed, for any point w that is a 
local minimum, there will generally be other points in weight space that are equiva- 
lent minima. For instance, in a two-layer network of the kind shown in Figure 6.9, 
with M hidden units, each point in weight space is a member of a family of M! 2% 
equivalent points. 

Furthermore, there may be multiple non-equivalent stationary points and in par- 
ticular multiple non-equivalent minima. A minimum that corresponds to the smallest 
value of the error function across the whole of w-space is said to be a global min- 
imum. Any other minima corresponding to higher values of the error function are 
said to be local minima. The error surfaces for deep neural networks can be very 
complex, and it was thought that gradient-based methods might become trapped in 
poor local minima. In practice, this seems not to be the case, and large networks can 
reach solutions with similar performance under a variety of initial conditions. 


7.1.1 Local quadratic approximation 


Insight into the optimization problem and into the various techniques for solving 
it can be obtained by considering a local quadratic approximation to the error func- 
tion. The Taylor expansion of E(w) around some point W in weight space is given 
by 


Cees ee ee ae Lw #)"H(w — #) (7.3) 


where cubic and higher terms have been omitted. Here b is defined to be the gradient 
of E evaluated at w 
b= VE|_ 2. (7.4) 


The Hessian is defined to be the corresponding matrix of second derivatives 


H(W) = VVE(w)| s- (7.5) 


w=—w 
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If there is a total of W weights and biases in the network, then w and b have length 
W and H has dimensionality W x W. From (7.3), the corresponding local approx- 
imation to the gradient is given by 


VE(w) =b+H(w-w). (7.6) 


For points w that are sufficiently close to W, these expressions will give reasonable 
approximations for the error and its gradient. 

Consider the particular case of a local quadratic approximation around a point 
w* that is a minimum of the error function. In this case there is no linear term, 
because VE = 0 at w*, and (7.3) becomes 


1 
E(w) = E(w*) + = w*)'H(w — w*) (7.7) 
where the Hessian H is evaluated at w*. To interpret this geometrically, consider the 
eigenvalue equation for the Hessian matrix: 


Hu; = Aju; (7.8) 
where the eigenvectors u; form a complete orthonormal set so that 


u; uy = Õij. (7.9) 


(3 


We now expand (w — w*) as a linear combination of the eigenvectors in the form 
w—wt=) au. (7.10) 
i 


This can be regarded as a transformation of the coordinate system in which the origin 
is translated to the point w* and the axes are rotated to align with the eigenvectors 
through the orthogonal matrix whose columns are {u;,..., uy}. By substituting 
(7.10) into (7.7) and using (7.8) and (7.9), the error function can be written in the 
form 


E(w) = E(w*) + ; X Aia? (7.11) 


Suppose we set all a; = 0 fori # j and then vary œj, corresponding to moving 
w away from w* in the direction of u;. We see from (7.11) that the error function 
will increase if the corresponding eigenvalue A; is positive and will decrease if it is 
negative. If all eigenvalues are positive then w* corresponds to a local minimum of 
the error function, whereas if they are all negative then w* corresponds to a local 
maximum. If we have a mix of positive and negative eigenvalues then w* represents 
a saddle point. 
A matrix H is said to be positive definite if, and only if, 


v Hv > 0, for all v. (7.12) 
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In the neighbourhood of a mini- “2 
mum w’, the error function can 
be approximated by a quadratic. 
Contours of constant error are 
then ellipses whose axes are 
aligned with the eigenvectors 
u; of the Hessian matrix, with 
lengths that are inversely pro- 
portional to the square roots of 
the corresponding eigenvectors 
Vi. 


Because the eigenvectors {u;} form a complete set, an arbitrary vector v can be 
written in the form 


v= > qu. (7.13) 
From (7.8) and (7.9), we then have 


v' Hv = Ss" Êi (7.14) 


and so H will be positive definite if, and only if, all its eigenvalues are positive. 
Thus, a necessary and sufficient condition for w* to be a local minimum is that the 
gradient of the error function should vanish at w* and the Hessian matrix evaluated 
at w* should be positive definite. In the new coordinate system, whose basis vectors 
are given by the eigenvectors {u;}, the contours of constant E(w) are axis-aligned 
ellipses centred on the origin, as illustrated in Figure 7.2. 


Gradient Descent Optimization 


There is little hope of finding an analytical solution to the equation V E(w) = 0 for 
an error function as complex as one defined by a neural network, and so we resort to 
iterative numerical procedures. The optimization of continuous nonlinear functions 
is a widely studied problem, and there exists an extensive literature on how to solve it 
efficiently. Most techniques involve choosing some initial value w® for the weight 
vector and then moving through weight space in a succession of steps of the form 


w = wt) + Awl) (7.15) 


where 7 labels the iteration step. Different algorithms involve different choices for 
the weight vector update Aw‘). 

Because of the complex shape of the error surface for all but the simplest neu- 
ral networks, the solution found will depend, among other things, on the particular 
choice of initial parameter values w°). To find a sufficiently good solution, it may 
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be necessary to run a gradient-based algorithm multiple times, each time using a dif- 
ferent randomly chosen starting point, and comparing the resulting performance on 
an independent validation set. 


7.2.1 Use of gradient information 


The gradient of an error function for a deep neural network can be evaluated 
efficiently using the technique of error backpropagation, and applying this gradient 
information can lead to significant improvements in the speed of network training. 
We can see why this is so, as follows. 

In the quadratic approximation to the error function given by (7.3), the error 
surface is specified by the quantities b and H, which contain a total of W(W + 
3) /2 independent elements (because the matrix H is symmetric), where W is the 
dimensionality of w (i.e., the total number of learnable parameters in the network). 
The location of the minimum of this quadratic approximation therefore depends on 
O(W7°) parameters, and we should not expect to be able to locate the minimum until 
we have gathered O(W7*) independent pieces of information. If we do not make 
use of gradient information, we would expect to have to perform O(W7) function 
evaluations, each of which would require O(W) steps. Thus, the computational 
effort needed to find the minimum using such an approach would be O (W°). 

Now compare this with an algorithm that makes use of the gradient information. 
Because VE is a vector of length W, each evaluation of VE brings W pieces of 
information, and so we might hope to find the minimum of the function in O(W) 
gradient evaluations. As we shall see, by using error backpropagation, each such 
evaluation takes only O(W) steps and so the minimum can now be found in O(W7?) 
steps. Although the quadratic approximation only holds in the neighbourhood of 
a minimum, the efficiency gains are generic. For this reason, the use of gradient 
information forms the basis of all practical algorithms for training neural networks. 


7.2.2 Batch gradient descent 


The simplest approach to using gradient information is to choose the weight up- 
date in (7.15) such that there is a small step in the direction of the negative gradient, 
so that 

w =w) — nV E(w) (7.16) 


where the parameter 7) > 0 is known as the learning rate. After each such update, the 
gradient is re-evaluated for the new weight vector w‘"+) and the process repeated. 
At each step, the weight vector is moved in the direction of the greatest rate of 
decrease of the error function, and so this approach is known as gradient descent or 
steepest descent. Note that the error function is defined with respect to a training set, 
and so to evaluate V E, each step requires that the entire training set be processed. 
Techniques that use the whole data set at once are called batch methods. 


7.2.3 Stochastic gradient descent 


Deep learning methods benefit greatly from very large data sets. However, batch 
methods can become extremely inefficient if there are many data points in the train- 
ing set because each error function or gradient evaluation requires the entire data set 
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Algorithm 7.1: Stochastic gradient descent 


Input: Training set of data points indexed by n € {1,..., N} 


Error function per data point En (w) 
Learning rate parameter 7 
Initial weight vector w 

Output: Final weight vector w 


nel 


repeat 
wew- nVEn(w) // update weight vector 


n—n+ 1(mod N) // iterate over data 
until convergence 
return w 


to be processed. To find a more efficient approach, note that error functions based on 
maximum likelihood for a set of independent observations comprise a sum of terms, 
one for each data point: 


E(w) = X E,(w). (7.17) 


The most widely used training algorithms for large data sets are based on a sequential 
version of gradient descent known as stochastic gradient descent (Bottou, 2010), or 
SGD, which updates the weight vector based on one data point at a time, so that 


wO = wD — nV En (wD), (7.18) 


This update is repeated by cycling through the data. A complete pass through the 
whole training set is known as a training epoch. This technique is also known as 
online gradient descent, especially if the data arises from a continuous stream of 
new data points. Stochastic gradient descent is summarized in Algorithm 7.1. 

A further advantage of stochastic gradient descent, compared to batch gradient 
descent, is that it handles redundancy in the data much more efficiently. To see this, 
consider an extreme example in which we take a data set and double its size by 
duplicating every data point. Note that this simply multiplies the error function by 
a factor of 2 and so is equivalent to using the original error function, if the value of 
the learning rate is adjusted to compensate. Batch methods will require double the 
computational effort to evaluate the batch error function gradient, whereas stochastic 
gradient descent will be unaffected. Another property of stochastic gradient descent 
is the possibility of escaping from local minima, since a stationary point with respect 
to the error function for the whole data set will generally not be a stationary point 
for each data point individually. 
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7.2.4 Mini-batches 


A downside of stochastic gradient descent is that the gradient of the error func- 
tion computed from a single data point provides a very noisy estimate of the gradient 
of the error function computed on the full data set. We can consider an interme- 
diate approach in which a small subset of data points, called a mini-batch, is used 
to evaluate the gradient at each iteration. In determining the optimum size for the 
mini-batch, note that the error in computing the mean from N samples is given by 
a/v N where c is the standard deviation of the distribution generating the data. This 
indicates that there are diminishing returns in estimating the true gradient from in- 
creasing the batch size. If we increase the size of the mini-batch by a factor of 100 
then the error only reduces by a factor of 10. Another consideration in choosing the 
mini-batch size is the desire to make efficient use of the hardware architecture on 
which the code is running. For example, on some hardware platforms, mini-batch 
sizes that are powers of 2 (for example, 64, 128, 256, ....) work well. 

One important consideration when using mini-batches is that the constituent data 
points should be chosen randomly from the data set, since in raw data sets there 
may be correlations between successive data points arising from the way the data 
was collected (for example, if the data points have been ordered alphabetically or 
by date). This is often handled by randomly shuffling the entire data set and then 
subsequently drawing mini-batches as successive blocks of data. The data set can 
also be reshuffled between iterations through the data set, so that each mini-batch is 
unlikely to have been used before, which can help escape local minima. The variant 
of stochastic gradient descent with mini-batches is summarized in Algorithm 7.2. 
Note that the learning algorithm is often still called ‘stochastic gradient descent’ 
even when mini-batches are used. 


7.2.5 Parameter initialization 


Iterative algorithms such as gradient descent require that we choose some ini- 
tial setting for the parameters being learned. The specific initialization can have a 
significant effect on how long it takes to reach a solution and on the generalization 
performance of the resulting trained network. Unfortunately, there is relatively little 
theory to guide the initialization strategy. 

One key consideration, however, is symmetry breaking. Consider a set of hidden 
units or output units that take the same inputs. If the parameters were all initialized 
with the same value, for example if they were all set to zero, the parameters of these 
units would all be updated in unison and the units would each compute the same 
function and hence be redundant. This problem can be addressed by initializing 
parameters randomly from some distribution to break symmetry. If computational 
resources permit, the network might be trained multiple times starting from different 
random initializations and the results compared on held-out data. 

The distribution used to initialize the weights is typically either a uniform distri- 
bution in the range [—e, €] or a zero-mean Gaussian of the form M (0, €”). The choice 
of the value of € is important, and various heuristics to select it have been proposed. 
One widely used approach is called He initialization (He et al., 2015b). Consider a 
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Algorithm 7.2: Mini-batch stochastic gradient descent 


Input: Training set of data points indexed by n € {1,..., N} 
Batch size B 
Error function per mini-batch Ey.» + 3—1(w) 


Learning rate parameter 7) 
Initial weight vector w 
Output: Final weight vector w 


nol 

repeat 

wew- NV En:n+B—1(W) // weight vector update 
nen+B 


if n > N then 
shuffle data 


nel 
end if 


until convergence 


return w 


network in which layer l evaluates the following transformations 


M 

a® = 5y wiz? (7.19) 
j=l 

2) = ReLU(a\”) (7.20) 


where M is the number of units that send connections to unit 7, and the ReLU activa- 
tion function is given by (6.17). Suppose we initialize the weights using a Gaussian 
N (0, €”), and suppose that the outputs ie of the units in layer l — 1 have variance 
A”. Then we can easily show that 


(7.21) 


var[z®] = ey (7.22) 


where the factor of 1/2 arises from the ReLU activation function. Ideally we want 
to ensure that the variance of the pre-activations neither decays to zero nor grows 
significantly as we propagate from one layer to the next. If we therefore require that 
the units at layer / also have variance A? then we arrive at the following choice for 
the standard deviation of the Gaussian used to initialize the weights that feed into a 
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Figure 7.3 Schematic illustration of fixed-step gradient 
descent for an error function that has substantially differ- 
ent curvatures along different directions. The error sur- 
face E has the form of a long valley, as depicted by the 
ellipses. Note that, for most points in weight space, the lo- 
cal negative gradient vector —V E does not point towards 
the minimum of the error function. Successive steps of 


gradient descent can therefore oscillate across the valley, 
leading to very slow progress along the valley towards the 
minimum. The vectors ui and uz are the eigenvectors of 
the Hessian matrix. 


Section 6.3.4 


Section 7.1.1 
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unit with M inputs: 


C4 I (7.23) 

It is also possible to treat the scale e€ of the initialization distribution as a hyper- 
parameter and to explore different values across multiple training runs. The bias pa- 
rameters are typically set to small positive values to ensure that most pre-activations 
are initially active during learning. This is particularly helpful with ReLU units, 
where we want the pre-activations to be positive so that there is a non-zero gradient 
to drive learning. 

Another important class of techniques for initializing the parameters of a neural 
network is by using the values that result from training the network on a different 
task or by exploiting various forms of unsupervised training. These techniques fall 
into the broad class of transfer learning techniques. 


Convergence 


When applying gradient descent in practice, we need to choose a value for the learn- 
ing rate parameter 7. Consider the simple error surface depicted in Figure 7.3 for a 
hypothetical two-dimensional weight space in which the curvature of F varies sig- 
nificantly with direction, creating a ‘valley’. At most points on the error surface, the 
local gradient vector for batch gradient descent, which is perpendicular to the local 
contour, does not point directly towards the minimum. Intuitively we might expect 
that increasing the value of ņ should lead to bigger steps through weight space and 
hence faster convergence. However, the successive steps oscillate back and forth 
across the valley, and if we increase 7) too much, those oscillations will become di- 
vergent. Because 7 must be kept sufficiently small to avoid divergent oscillations 
across the valley, progress along the valley is very slow. Gradient descent then takes 
many small steps to reach the minimum and is a very inefficient procedure. 

We can gain deeper insight into the nature of this problem by considering the 
quadratic approximation to the error function in the neighbourhood of the minimum. 
From (7.7), (7.8), and (7.10), the gradient of the error function in this approximation 
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can be written as 
VE = J adi. (7.24) 


u 


Again using (7.10) we can express the change in the weight vector in terms of corre- 
sponding changes in the coefficients {a; }: 


Aw = `> AQ;u;. (7.25) 


Combining (7.24) with (7.25) and the gradient descent formula (7.16) and using the 
orthonormality relation (7.9) for the eigenvectors of the Hessian, we obtain the fol- 
lowing expression for the change in a; at each step of the gradient descent algorithm: 


Aa; = —nrQij (7.26) 
from which it follows that 


ae’ = (l=, jar" (7.27) 
where ‘old’ and ‘new’ denote values before and after a weight update. Using the 
orthonormality relation (7.9) for the eigenvectors together with (7.10), we have 

T 


u; (w — w*) =a; (7.28) 
and so a; can be interpreted as the distance to the minimum along the direction u;. 
From (7.27) we see that these distances evolve independently such that, at each step, 
the distance along the direction of u; is multiplied by a factor (1 —7.;). After a total 


of T steps we have 
a?) =(1- mri) a, (7.29) 


It follows that, provided |1 — 7A;| < 1, the limit T — oo leads to a; = 0, which 
from (7.28) shows that w = w* and so the weight vector has reached the minimum 
of the error. 

Note that (7.29) demonstrates that gradient descent leads to linear convergence 
in the neighbourhood of a minimum. Also, convergence to the stationary point re- 
quires that all the A; be positive, which in turn implies that the stationary point is 
indeed a minimum. By making n larger we can make the factor (1 — 7\;) smaller 
and hence improve the speed of convergence. There is a limit to how large 7 can 
be made, however. We can permit (1 — 7)A;) to go negative (which gives oscillating 
values of a;), but we must ensure that |1 — 7A;| < 1 otherwise the a; values will 
diverge. This limits the value of 7 to 7 < 2/Amax Where Amax is the largest of the 
eigenvalues. The rate of convergence, however, is dominated by the smallest eigen- 
value, so with 77 set to its largest permitted value, the convergence along the direction 
corresponding to the smallest eigenvalue (the long axis of the ellipse in Figure 7.3) 


will be governed by 
2Xmin 
i 7.30 
( Amax ) ( 
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Figure 7.4 With a fixed learning rate parameter, gradient descent down a surface with low curvature 


leads to successively smaller steps corresponding to linear convergence. In such a sit- 
uation, the effect of a momentum term is like an increase in the effective learning rate 
parameter. 


where Amin is the smallest eigenvalue. If the ratio Amin/Amax (whose reciprocal 
is known as the condition number of the Hessian) is very small, corresponding to 
highly elongated elliptical error contours as in Figure 7.3, then progress towards the 
minimum will be extremely slow. 


7.3.1 Momentum 


One simple technique for dealing with the problem of widely differing eigenval- 
ues is to add a momentum term to the gradient descent formula. This effectively adds 
inertia to the motion through weight space and smooths out the oscillations depicted 
in Figure 7.3. The modified gradient descent formula is given by 


Aw») = -nVE (wt?) + pAw'?-?) (7.31) 


where u is called the momentum parameter. The weight vector is then updated using 
(7.15). 

To understand the effect of the momentum term, consider first the motion through 
a region of weight space for which the error surface has relatively low curvature, as 
indicated in Figure 7.4. If we make the approximation that the gradient is unchang- 
ing, then we can apply (7.31) iteratively to a long series of weight updates, and then 
sum the resulting arithmetic series to give 


Aw = —-nVE{1+ p+’? +...} (7.32) 


s= lyp (1.33) 
Lp 
and we see that the result of the momentum term is to increase the effective learning 
rate from 7 to 7/(1 — u). 
By contrast, in aregion of high curvature in which gradient descent is oscillatory, 
as indicated in Figure 7.5, successive contributions from the momentum term will 
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For a situation in which successive 
steps of gradient descent are oscilla- 
tory, a momentum term has little in- 
fluence on the effective value of the 
learning rate parameter. 


tend to cancel and the effective learning rate will be close to 7. Thus, the momentum 
term can lead to faster convergence towards the minimum without causing divergent 
oscillations. A schematic illustration of the effect of a momentum term is shown in 
Figure 7.6. 

Although the inclusion of momentum can lead to an improvement in the per- 
formance of gradient descent, it also introduces a second parameter js whose value 
needs to be chosen, in addition to that of the learning rate parameter 7. From (7.33) 
we see that u should be in the range 0 < ys < 1. A typical value used in practice 
is u = 0.9. Stochastic gradient descent with momentum is summarized in Algo- 
rithm 7.3. 

The convergence can be further accelerated using a modified version of momen- 
tum called Nesterov momentum (Nesterov, 2004; Sutskever et al., 2013). In con- 
ventional stochastic gradient descent with momentum, we first compute the gradient 
at the current location then take a step that is amplified by adding momentum from 
the previous step. With the Nesterov method, we change the order of these and first 
compute a step based on the previous momentum, then calculate the gradient at this 


Illustration of the effect of adding 
a momentum term to the gradient 
descent algorithm, showing the 
more rapid progress along the val- 
ley of the error function, compared 
with the unmodified gradient de- 
scent shown in Figure 7.3. 
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Algorithm 7.3: Stochastic gradient descent with momentum 


Input: Training set of data points indexed by n € {1,..., N} 
Batch size B 


Error function per mini-batch En:n+B-—1(W) 


Learning rate parameter 7) 
Momentum parameter u 
Initial weight vector w 


Output: Final weight vector w 


nel 

Aw + 0 

repeat 

Aw + —9V Enin+B—1(W) + pAw // calculate update term 
w + we Aw // weight vector update 

nen+B 


ifn > N then 
shuffle data 


nel 
end if 


until convergence 


return w 


new location to find the update, so that 
Aw") = -VE (wi) + Awt?) + pAw P, (7.34) 


For batch gradient descent, Nesterov momentum can improve the rate of conver- 
gence, although for stochastic gradient descent it can be less effective. 


7.3.2 Learning rate schedule 


In the stochastic gradient descent learning algorithm (7.18), we need to specify 
a value for the learning rate parameter 77. If 7) is very small then learning will proceed 
slowly. However, if 7) is increased too much it can lead to instability. Although some 
oscillation can be tolerated, it should not be divergent. In practice, the best results 
are obtained by using a larger value for 77 at the start of training and then reducing the 
learning rate over time, so that the value of 7 becomes a function of the step index T: 


wi) = wD — 9? -DYVB, (wD), (7.35) 
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Examples of learning rate schedules include linear, power law, and exponential de- 
cay: 


q = (1—7/K) © + (7/K)n™ (7.36) 
nf?) =n (1-47/8)° (7.37) 
1) = er! (7.38) 


where in (7.36) the value of 7 reduces linearly over K steps, after which its value is 
held constant at 7‘). Good values for the hyperparameters 7°, n9, K, S, and c 
must be found empirically. It can be very helpful in practice to monitor the learning 
curve showing how the error function evolves during the gradient descent iteration 
to ensure that it is decreasing at a suitable rate. 


7.3.3 RMSProp and Adam 


We saw that the optimal learning rate depends on the local curvature of the er- 
ror surface, and moreover that this curvature can vary according to the direction in 
parameter space. This motivates several algorithms that use different learning rates 
for each parameter in the network. The values of these learning rates are adjusted 
automatically during training. Here we review some of the most widely used exam- 
ples. Note, however, that this intuition really applies only if the principal curvature 
directions are aligned with the axes in weight space, corresponding to a locally diag- 
onal Hessian matrix, which is unlikely to be the case in practice. Nevertheless, these 
types of algorithms can be effective and are widely used. 

The key idea behind AdaGrad, short for ‘adaptive gradient’, is to reduce each 
learning rate parameter over time by using the accumulated sum of squares of all the 
derivatives calculated for that parameter (Duchi, Hazan, and Singer, 2011). Thus, 
parameters associated with high curvature are reduced most rapidly. Specifically, 


2 
rD = D4 (=) (7.39) 
Ow; 
wP = wr? I (5 = ) (7.40) 
4/ Ty +6 Ow; 


where n is the learning rate parameter, and ô is a small constant, say 1078, that 
ensures numerical stability in the event that r; is close to zero. The algorithm is 
initialized with 7) = 0. Here E(w) is the error function for a particular mini-batch, 
and the update (7.40) is standard stochastic gradient descent but with a modified 
learning rate that is specific to each parameter. 

One problem with AdaGrad is that it accumulates the squared gradients from the 
very start of training, and so the associated weight updates can become very small, 
which can slow down training too much in the later phases. The idea behind the 
RMSProp algorithm, which is short for ‘root mean square propagation’, is to replace 
the sum of squared gradients of AdaGrad with an exponentially weighted average 
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(Hinton, 2012), giving 


2 
P = arte + 1a) (220) aan 
(r) _. (7-1) n OE(w) 
We =w; rT +6 ( Ow; ee, 


where 0 < 8 < 1 anda typical value is 6 = 0.9. 

If we combine RMSProp with momentum, we obtain the Adam optimization 
method (Kingma and Ba, 2014) where the name is derived from ‘adaptive moments’. 
Adam stores the momentum for each parameter separately using update equations 
that consist of exponentially weighted moving averages for both the gradients and 
the squared gradients in the form 


7 2 ðE 
s = psf + (1 ai) ( w) (7.43) 
, = dE(w)\* 
rP = parf + (1 Ba ( w) (1.44) 
(T) 
a(t) _ si 
g= (7.45) 
1-6 
rT 
R = (7.46) 
1- £3 
wi? = wi) ——t _. a) 
TT +6 


Here the factors 1/(1— 87 ) and 1/(1— 83) correct for a bias introduced by initializing 
36°) and 7 to zero. Note that the bias goes to zero as T becomes large, since 3; < 1, 
and so in practice this bias correction is sometimes omitted. Typical values for the 
weighting parameters are 3, = 0.9 and 82 = 0.99. Adam is the most widely adopted 


learning algorithm in deep learning and is summarized in Algorithm 7.4. 


Normalization 


Normalization of the variables computed during the forward pass through a neural 
network removes the need for the network to deal with extremely large or extremely 
small values. Although in principle the weights and biases in a neural network can 
adapt to whatever values the input and hidden variables take, in practice normaliza- 
tion can be crucial for ensuring effective training. Here we consider three kinds of 
normalization according to whether we are normalizing across the input data, across 
mini-batches, or across layers. 


7.4. Normalization 


Algorithm 7.4: Adam optimization 


Input: Training set of data points indexed by n € {1,... 
Batch size B 
Error function per mini-batch En:n+B—1(W) 
Learning rate parameter 7) 
Decay parameters 6) and (4 
Stabilization parameter ô 

Output: Final weight vector w 


nel 
s0 
r<0O 


repeat 
Choose a mini-batch at random from D 


g= —V En:n+B-1 (W) // evaluate gradient vector 
a Ber HE 

r 4 br + (1 = Ba)g Og // element-wise multiply 
se s/(1 = BI) // bias correction 

Tre r/(1 = £5) // bias correction 


Aw < 


A 


lement-wise operations 


S 
IE Ed 
Vr+6 
wt we Aw // weight vector update 
nen+B 


if n+ B > N then 
shuffle data 


nel 
end if 


until convergence 


return w 
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Figure 7.7 


7.4.1 Data normalization 


Sometimes we encounter data sets in which different input variables span very 
different ranges. For example, in health data, a patient’s height might be measured 
in meters, such as 1.8m, whereas their blood platelet count might be measured in 
platelets per microliter, such as 300,000 platelets per uL. Such variations can make 
gradient descent training much more challenging. Consider a single-layer regression 
network with two weights in which the two corresponding input variables have very 
different ranges. Changes in the value of one of the weights produce much larger 
changes in the output, and hence in the error function, than would similar changes in 
the other weight. This corresponds to an error surface with very different curvatures 
along different axes as illustrated in Figure 7.3. 

For continuous input variables, it can therefore be very beneficial to re-scale the 
input values so that they span similar ranges. This is easily done by first evaluating 
the mean and variance of each input: 


1 N 
Hi = N >, Tni (7.48) 
1 N 
2 bee fia 
Ce = N 2 Cni Li) ; (7.49) 


which is a calculation that is performed once, before any training is started. The 

input values are then re-scaled using 

Tni bi 
Ti 


(7.50) 


Tni = 


so that the re-scaled values {Z%,,;} have zero mean and unit variance. Note that the 
same values of u; and o; must be used to pre-process any development, validation, or 
test data to ensure that all inputs are scaled in the same way. Input data normalization 
is illustrated in Figure 7.7. 


Illustration of the effect of input data normal- 
ization. The red circles show the original data 
points for a data set with two variables. The 
blue crosses show the data set after normal- 
ization such that each variable now has zero 
mean and unit variance across the data set. 
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7.4.2 Batch normalization 


We have seen the importance of normalizing the input data, and we can apply 
similar reasoning to the variables in each hidden layer of a deep network. If there is 
wide variation in the range of activation values in a particular hidden layer, then nor- 
malizing those values to have zero mean and unit variance should make the learning 
problem easier for the next layer. However, unlike normalization of the input values, 
which can be done once prior to the start of training, normalization of the hidden- 
unit values will need to be repeated during training every time the weight values are 
updated. This is called batch normalization (Ioffe and Szegedy, 2015). 

A further motivation for batch normalization arises from the phenomena of van- 
ishing gradients and exploding gradients, which occur when we try to train very deep 
neural networks. From the chain rule of calculus, the gradient of an error function Æ 
with respect to a parameter in the first layer of the network is given by 


JE a az an 
oo sans T 7.51 
ðw; 3 22 ðw; ak a (7.51) 


where 2) denotes the activation of node j in layer k, and each of the partial deriva- 
tives on the right-hand side of (7.51) represents the elements of the Jacobian matrix 
for that layer. The product of a large number of such terms will tend towards 0 if 
most of them have a magnitude < 1 and will tend towards oo if most of them have 
a magnitude > 1. Consequently, as the depth of a network increases, error function 
gradients can tend to become either very large or very small. Batch normalization 
largely resolves this issue. 

To see how batch normalization is defined, consider a specific layer within a 
multi-layer network. Each hidden unit in that layer computes a nonlinear function of 
its input pre-activation z; = h(a;), and so we have a choice of whether to normalize 
the pre-activation values a; or the activation values z;. In practice, either approach 
may be used, and here we illustrate the procedure by normalizing the pre-activations. 
Because weight values are updated after each mini-batch of examples, we apply the 
normalization to each mini-batch. Specifically, for a mini-batch of size K, we define 


1 < 
=z > ani (1.52) 

1# 

De 4)? 
Oi — K > (ani Hi) (7.53) 
üni = Oni — Hi (1.54) 
yo? +s 

where the summations over n = 1,..., K are taken over the elements of the mini- 


batch. Here 6 is a small constant, introduced to avoid numerical issues in situations 
where o? is small. 

By normalizing the pre-activations in a given layer of the network, we reduce 
the number of degrees of freedom in the parameters of that layer and hence we 
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Figure 7.8 Illustration of batch normalization and layer normalization in a neural network. In batch normaliza- 
tion, shown in (a), the mean and variance are computed across the mini-batch separately for each hidden unit. 
In layer normalization, shown in (b), the mean and variance are computed across the hidden units separately for 


each data point. 
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reduce its representational capability. We can compensate for this by re-scaling the 
pre-activations of the batch to have mean ĝ; and standard deviation y; using 


üni = “Vi Ons + Bi (7.55) 


where (3; and +; are adaptive parameters that are learned by gradient descent jointly 
with the weights and biases of the network. These learnable parameters represent a 
key difference compared to input data normalization. 

It might appear that the transformation (7.55) has simply undone the effect of the 
batch normalization since the mean and variance can now adapt to arbitrary values 
again. However, the crucial difference is in the way the parameters evolve during 
training. For the original network, the mean and variance across a mini-batch are 
determined by a complex function of all the weights and biases in the layer, whereas 
in the representation given by (7.55), they are determined directly by independent 
parameters 3; and yi, which turn out to be much easier to learn during gradient 
descent. 

Equations (7.52) — (7.55) describe a transformation of the variables that is dif- 
ferentiable with respect to the learnable parameters 8; and yi. This can be viewed 
as an additional layer in the neural network, and so each standard hidden layer can 
be followed by a batch normalization layer. The structure of the batch-normalization 
process is illustrated in Figure 7.8. 

Once the network is trained and we want to make predictions on new data, we 
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no longer have the training mini-batches available, and we cannot determine a mean 
and variance from just one data example. To solve this, we could in principle eval- 
uate u; and øg? for each layer across the whole training set after we have made the 
final update to the weights and biases. However, this would involve processing the 
whole data set just to evaluate these quantities and is therefore usually too expensive. 
Instead, we compute moving averages throughout the training phase: 


ES” = op? + (1- a)u (7.56) 
a” =a” + (1-a)o; (7.57) 


where 0 < a < 1. These moving averages play no role during training but are used 
to process new data points during the inference phase. 

Although batch normalization is very effective in practice, there is uncertainty 
as to why it works so well. Batch normalization was originally motivated by noting 
that updates to weights in earlier layers of the network change the distribution of 
values seen by later layers, a phenomenon called internal covariate shift. However, 
later studies (Santurkar et al., 2018) suggest that covariate shift is not a significant 
factor and that the improved training results from an improvement in the smoothness 
of the error function landscape. 


7.4.3 Layer normalization 


With batch normalization, if the batch size is too small then the estimates of the 
mean and variance become too noisy. Also, for very large training sets, the mini- 
batches may be split across different GPUs, making global normalization across the 
mini-batch inefficient. An alternative to normalizing across examples within a mini- 
batch for each hidden unit separately is to normalize across the hidden-unit values 
for each data point separately. This is known as layer normalization (Ba, Kiros, and 
Hinton, 2016). It was introduced in the context of recurrent neural networks where 
the distributions change after each time step making batch normalization infeasible. 
However, it is useful in other architectures such as transformer networks. 

By analogy with batch normalization, we therefore make the following transfor- 
mation: 


1 

" M i=l q i 

1& 

2 2 
= — ni — hi 7.59 
Tn = a7 > (ani — pi) (7.59) 
m= (7.60) 
o2 +ô 

where the sums 7 = 1,..., M are taken over all hidden units in the layer. As with 


batch normalization, additional learnable mean and standard deviation parameters 
are introduced for each hidden unit separately in the form (7.55). Note that the same 
normalization function can be employed during training and during inference, and 
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so there is no need to store moving averages. Layer normalization is compared with 
batch normalization in Figure 7.8. 


(x) By substituting (7.10) into (7.7) and using (7.8) and (7.9), show that the error 
function (7.7) can be written in the form (7.11). 


(x) Consider a Hessian matrix H with eigenvector equation (7.8). By setting the 
vector v in (7.14) equal to each of the eigenvectors u; in turn, show that H is positive 
definite if, and only if, all its eigenvalues are positive. 


(xx) By considering the local Taylor expansion (7.7) of an error function about a 
stationary point w*, show that the necessary and sufficient condition for the station- 
ary point to be a local minimum of the error function is that the Hessian matrix H, 
defined by (7.5) with W = w’%, is positive definite. 


(x x) Consider a linear regression model with a single input variable x and a single 
output variable y of the form 


y(xz,w,b) = wx +b (7.61) 


together with a sum-of-squares error function given by 


1 N 
E(w) = 5 X {y(an,w,b) — tn}. (7.62) 
n=l 


Derive expressions for the elements of the 2 x 2 Hessian matrix given by the second 
derivatives of the error function with respect to the weight parameter w and bias pa- 
rameter b. Show that the trace and the determinant of this Hessian are both positive. 
Since the trace represents the sum of the eigenvalue and the determinant corresponds 
to the product of the eigenvalues, then both eigenvalues are positive and hence the 
stationary point of the error function is a minimum. 


(x x) Consider a single-layer classification model with a single input variable x and 
a single output variable y of the form 


y(x, w, b) = o(wa + b) (7.63) 


where o(-) is the logistic sigmoid function defined by (5.42) together with a cross- 
entropy error function given by 


N 
B(w,b) =X {tn ny(an,w,b) + (1— tn) m(1 — y(an,w,b))}. (7.64) 


n=1 


Derive expressions for the elements of the 2 x 2 Hessian matrix given by the second 
derivatives of the error function with respect to the weight parameter w and bias pa- 
rameter b. Show that the trace and the determinant of this Hessian are both positive. 
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Since the trace represents the sum of the eigenvalue and the determinant corresponds 
to the product of the eigenvalues, then both eigenvalues are positive and hence the 
stationary point of the error function is a minimum. 


(x x) Consider a quadratic error function defined by (7.7) in which the Hessian matrix 
H has an eigenvalue equation given by (7.8). Show that the contours of constant 
error are ellipses whose axes are aligned with the eigenvectors u; with lengths that 
are inversely proportional to the square roots of the corresponding eigenvalues A;. 


(x) Show that, as a consequence of the symmetry of the Hessian matrix H, the 
number of independent elements in the quadratic error function (7.3) is given by 
W(W +3)/2. 


(x) Consider a set of values 71,...,2, drawn from a distribution with mean u and 
variance o°, and define the sample mean to be 


1 N 
T= >, En. (7.65) 


Show that the expectation of the squared error (£ — p)? with respect to the distri- 
bution from which the data is drawn is given by o?/N. This shows that the RMS 
error in the sample mean is given by o/ VN, which decreases relatively slowly as 
the sample size N increases. 


(x x) Consider a layered network that computes the functions (7.19) and (7.20) in 
layer l. Suppose we initialize the weights using a Gaussian M (0, e°), and suppose 
that the outputs z070 of the units in layer | — 1 have variance A”. By using the form 
of the ReLU activation function, show that the mean and variance of the outputs in 
layer l are given by (7.21) and (7.22), respectively. Hence, show that if we want 
the units in layer l also to have pre-activations with variance \* then the value of e 
should be given by (7.23). 


(x x) By making use of (7.7), (7.8), and (7.10), derive the results (7.24) and (7.25), 
which express the gradient vector and a general weight update, as expansions in the 
eigenvectors of the Hessian matrix. Use these results, together with the eigenvector 
orthonormality relation (7.9) and the batch gradient descent formula (7.16), to de- 
rive the result (7.26) for the batch gradient descent update expressed in terms of the 
coefficients {a; }. 


(x) Consider a smoothly varying error surface with low curvature such that the gra- 
dient varies only slowly with position. Show that, for small values of the learning 
rate and momentum parameters, the Nesterov momentum gradient update defined 
by (7.34) is equivalent to the standard gradient descent with momentum defined by 
(7.31). 


(x x) Consider a sequence of values {x1,..., £y } of some variable x, and suppose 
we compute an exponentially weighted moving average using the formula 
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7.14 


where 0 < 6 < 1. By making use of the following result for the sum of a finite 
geometric series 


nm 1 — n 
bP Be = i £ (7.67) 
k=] 


show that if the sequence of averages is initialized using 4o = 0, then the estimators 
are biased and that the bias can be corrected using 
~ Hn 


fin = Ba (7.68) 


(x) In gradient descent, the weight vector w is updated by taking a step in weight 
space in the direction of the negative gradient governed by a learning rate parameter 
n. Suppose instead that we choose a direction d in weight space along which we 
minimize the error function, given the current weight vector w‘7). This involves 
minimizing the quantity 

E(w + dd) (7.69) 


as a function of À to give a value * corresponding to a new weight vector w+», 
Show that the gradient of E(w) at w+»? is orthogonal to the vector d. This is 
known as a ‘line search’ method and it forms the basis for a variety of numerical 
optimization algorithms (Bishop, 1995b). 


(x) Show that the renormalized input variables defined by (7.50), where u; is defined 
by (7.48) and o? is defined by (7.49), have zero mean and unit variance. 
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Check for 
updates 


Backpropagation 


Our goal in this chapter is to find an efficient technique for evaluating the gradient 
of an error function E(w) for a feed-forward neural network. We will see that this 
can be achieved using a local message-passing scheme in which information is sent 
backwards through the network and is known as error backpropagation, or some- 
times simply as backprop. 

Historically, the backpropagation equations would have been derived by hand 
and then implemented in software alongside the forward propagation equations, with 
both steps taking time and being prone to mistakes. Modern neural network software 
environments, however, allow virtually any derivatives of interest to be calculated 
efficiently with only minimal effort beyond that of coding up the original network 
function. This idea, called automatic differentiation, plays a key role in modern deep 
learning. However, it is valuable to understand how the calculations are performed 
so that we are not relying on ‘black box’ software solutions. In this chapter we 
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8.1. 


therefore explain the key concepts of backpropagation, and explore the framework 
of automatic differentiation in detail. 

Note that the term ‘backpropagation’ is used in the neural computing literature 
in a variety of different ways. For instance, a feed-forward architecture may be 
called a backpropagation network. Also the term ‘backpropagation’ is sometimes 
used to describe the end-to-end training procedure for a neural network including 
the gradient descent parameter updates. In this book we will use ‘backpropagation’ 
specifically to describe the computational procedure used in the numerical evaluation 
of derivatives such as the gradient of the error function with respect to the weights 
and biases of a network. This procedure can also be applied to the evaluation of other 
important derivatives such as the Jacobian and Hessian matrices. 


Evaluation of Gradients 


We now derive the backpropagation algorithm for a general network having arbitrary 
feed-forward topology, arbitrary differentiable nonlinear activation functions, and a 
broad class of error function. The resulting formulae will then be illustrated using 
a simple layered network structure having a single layer of sigmoidal hidden units 
together with a sum-of-squares error. 

Many error functions of practical interest, for instance those defined by maxi- 
mum likelihood for a set of i.i.d. data, comprise a sum of terms, one for each data 
point in the training set, so that 


E(w) = X E,(w). (8.1) 


Here we will consider the problem of evaluating V En (w) for one such term in the 
error function. This may be used directly for stochastic gradient descent, or the 
results could be accumulated over a set of training data points for batch or mini- 
batch methods. 


8.1.1 Single-layer networks 


Consider first a simple linear model in which the outputs yx are linear combina- 
tions of the input variables x; so that 


Ua = Y Writ (8.2) 


together with a sum-of-squares error function that, for a particular input data point 


n, takes the form i 


En = 9 2 (Unk = tar)” (8.3) 
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where Ynk = Yr(Xn, W), and tng is the associated target value. The gradient of this 
error function with respect to a weight wj; is given by 


OE, 
Ow ji 


= (Ynj = tnj) Tni: (8.4) 


This can be interpreted as a ‘local’ computation involving the product of an ‘error 
signal’ Ynj — tnj associated with the output end of the link w;; and the variable £n; 
associated with the input end of the link. In Section 5.4.3, we saw how a similar 
formula arises with the logistic-sigmoid activation function together with the cross- 
entropy error function and similarly for the softmax activation function together with 
its matching multivariate cross-entropy error function. We will now see how this sim- 
ple result extends to the more complex setting of multilayer feed-forward networks. 


8.1.2 General feed-forward networks 


In general, a feed-forward network consists of a set of units each of which com- 
putes a weighted sum of its inputs: 


aj; = D Wjiži (8.5) 


where z; is either the activation of another unit or an input unit that sends a connec- 
tion to unit j, and wj; is the weight associated with that connection. Biases can be 
included in this sum by introducing an extra unit, or input, with activation fixed at 
+1, and so we do not need to deal with biases explicitly. The sum in (8.5), known 
as a pre-activation, is transformed by a nonlinear activation function h(-) to give the 
activation zj of unit 7 in the form 


Note that one or more of the variables z; in the sum in (8.5) could be an input, and 
similarly, the unit 7 in (8.6) could be an output. 

For each data point in the training set, we will suppose that we have supplied the 
corresponding input vector to the network and calculated the activations of all the 
hidden and output units in the network by successive application of (8.5) and (8.6). 
This process is called forward propagation because it can be regarded as a forward 
flow of information through the network. 

Now consider the evaluation of the derivative of E, with respect to a weight 
wji. The outputs of the various units will depend on the particular input data point 
n. However, to keep the notation uncluttered, we will omit the subscript n from the 
network variables. First note that E,, depends on the weight wj; only via the summed 
input a; to unit 7. We can therefore apply the chain rule for partial derivatives to give 


OE, — OEn Oa; 
Ow ji Oa; wji 


(8.7) 


We now introduce a useful notation: 


236 8. BACKPROPAGATION 


Figure 8.1 


Section 5.4.6 


Exercise 8.1 


Illustration of the calculation of ô; for hidden 
unit 7 by backpropagation of the 6’s from those 
units k to which unit j sends connections. The 
black arrows denote the direction of information 
flow during forward propagation, and the red ar- 
rows indicate the backward propagation of error 
information. 


where the 6’s are often referred to as errors for reasons we will see shortly. Using 
(8.5), we can write 


Oa; 
Die = Zi: (8.9) 
Substituting (8.8) and (8.9) into (8.7), we then obtain 
OE 
Z = ĝjZi. 8.10 
Ow ji jZ ( ) 


Equation (8.10) tells us that the required derivative is obtained simply by multiplying 
the value of ô for the unit at the output end of the weight by the value of z for the 
unit at the input end of the weight (where z = 1 for a bias). Note that this takes the 
same form as that found for the simple linear model in (8.4). Thus, to evaluate the 
derivatives, we need calculate only the value of 6; for each hidden and output unit in 
the network and then apply (8.10). 

As we have seen already, for the output units, we have 


Ôk = Yk — tk (8.11) 


provided we are using the canonical link as the output-unit activation function. To 
evaluate the ô’s for hidden units, we again make use of the chain rule for partial 
derivatives: 


OEn En Oak 
= = 12 
0, Oa; » Oa, Oa; et) 


where the sum runs over all units k to which unit j sends connections. The arrange- 
ment of units and weights is illustrated in Figure 8.1. Note that the units labelled k 
include other hidden units and/or output units. In writing down (8.12), we are mak- 
ing use of the fact that variations in a; give rise to variations in the error function 
only through variations in the variables ax. 

If we now substitute the definition of 6; given by (8.8) into (8.12) and make use 
of (8.5) and (8.6), we obtain the following backpropagation formula: 


Ôj = h' (aj) X. Wrjôk, (8.13) 
k 


which tells us that the value of 6 for a particular hidden unit can be obtained by 
propagating the 6’s backwards from units higher up in the network, as illustrated 
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Algorithm 8.1: Backpropagation 


Input: Input vector x,, 
Network parameters w 
Error function E(w) for input £n 
Activation function h(a) 


Output: Error function derivatives {9 E, /Ow;;} 


// Forward propagation 

for j € all hidden and output units do 
o — Swe, 7 / teh) ineludes inputs {o,) 
25 h(a;) i) activa ion Lunet lon 


end for 


// Error evaluation 


for k € all output units do 
On <— meee compute errors 
ak 
end for 
// Backward propagation, in reverse order 


for j € all hidden units do 
6; {= VOD En Wrj Ok // recursive backward evaluation 


~¢ 54; Zi // evaluate derivatives 
Ow ji 


end for 


r OE, 
return oe 


in Figure 8.1. Note that the summation in (8.13) is taken over the first index on 
Wpj (corresponding to backward propagation of information through the network), 
Exercise 8.2 whereas in the forward propagation equation (8.5), it is taken over the second index. 
Because we already know the values of the 6’s for the output units, it follows that 
by recursively applying (8.13), we can evaluate the 6’s for all the hidden units in a 
feed-forward network, regardless of its topology. The backpropagation procedure is 
summarized in Algorithm 8.1. 
For batch methods, the derivative of the total error E can then be obtained by 
repeating the above steps for each data point in the training set and then summing 
over all data points in the batch or mini-batch: 


14 
- = en eat) 
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In the above derivation we have implicitly assumed that each hidden or output unit in 
the network has the same activation function h(-). However, the derivation is easily 
generalized to allow different units to have individual activation functions, simply 
by keeping track of which form of h(-) goes with which unit. 


8.1.3 A simple example 


The above derivation of the backpropagation procedure allowed for general 
forms for the error function, the activation functions, and the network topology. To 
illustrate the application of this algorithm, we consider a two-layer network of the 
form illustrated in Figure 6.9, together with a sum-of-squares error. The output units 
have linear activation functions, so that y = ap, and the hidden units have sigmoidal 
activation functions given by 


h(a) = tanh(a) (8.15) 


where tanh(a) is defined by (6.14). A useful feature of this function is that its 
derivative can be expressed in a particularly simple form: 


h'(a) = 1 — h(a)’. (8.16) 


We also consider a sum-of-squares error function, so that for data point n the error 
is given by 
K 


En = 530 — tk)? (8.17) 
k=1 
where yx is the activation of output unit k, and tẹ is the corresponding target value 
for a particular input vector Xn. 
For each data point in the training set in turn, we first perform a forward propa- 


gation using 


D 

q = X wpn (8.18) 
i=0 

zj = tanh(a;) (8.19) 
M 

Yk = Yoz (8.20) 
j=0 


where D is the dimensionality of the input vector x and M is the total number of 
hidden units. Also we have used £o = zo = 1 to allow bias parameters to be included 
in the weights. Next we compute the 6’s for each output unit using 


On = Yk — tr. (8.21) 


Then, we backpropagate these errors to obtain 6’s for the hidden units using 


K 
6; = (1-3) Y w ir (8.22) 
k=1 
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which follows from (8.13) and (8.16). Finally, the derivatives with respect to the 
first-layer and second-layer weights are given by 


En En 
. Gy) 0;%i, eo = ĝkŽj. (8.23) 


8.1.4 Numerical differentiation 


One of the most important aspects of backpropagation is its computational effi- 
ciency. To understand this, let us examine how the number of compute operations 
required to evaluate the derivatives of the error function scales with the total number 
W of weights and biases in the network. 

A single evaluation of the error function (for a given input data point) would 
require O(W) operations, for sufficiently large W. This follows because, except for 
a network with very sparse connections, the number of weights is typically much 
greater than the number of units, and so the bulk of the computational effort in for- 
ward propagation arises from evaluation of the sums in (8.5), with the evaluation of 
the activation functions representing a small overhead. Each term in the sum in (8.5) 
requires one multiplication and one addition, leading to an overall computational 
cost that is O(W). 

An alternative approach to backpropagation for computing the derivatives of the 
error function is to use finite differences. This can be done by perturbing each weight 
in turn and approximating the derivatives by using the expression 


OE, = En (wy. + €) = En (wy) 
OW ji € 


EO(e) (8.24) 


where e < 1. In a software simulation, the accuracy of the approximation to the 
derivatives can be improved by making € smaller, until numerical round-off problems 
arise. The accuracy of the finite differences method can be improved significantly 
by using symmetrical central differences of the form 

OE, En (wy + €) = En (wy = €) i 


2 
— T . 2 
wji 2e Ned Aa 


In this case, the © (e) corrections cancel, as can be verified by a Taylor expansion 
of the right-hand side of (8.25), and so the residual corrections are O(e?). Note, 
however, that the number of computational steps is roughly doubled compared with 
(8.24). Figure 8.2 shows a plot of the error between a numerical evaluation of a 
gradient using both finite differences (8.24) and central differences (8.25) versus the 
analytical result, as a function of the value of the step size e. 

The main problem with numerical differentiation is that the highly desirable 
O(W) scaling has been lost. Each forward propagation requires O(W) steps, and 
there are W weights in the network each of which must be perturbed individually, so 
that the overall computational cost is O(W7*). 

However, numerical differentiation can play a useful role in practice, because 
a comparison of the derivatives calculated from a direct implementation of back- 
propagation, or from automatic differentiation, with those obtained using central 
differences provides a powerful check on the correctness of the software. 
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Figure 8.2 The red curve shows a 

plot of the error between the numer- 

ical evaluation of a gradient using fi- 10712 
nite differences (8.24) and the analyti- 

cal result, as a function of e. As e de- finite differences 
creases, the plot initially shows a lin- 

ear decrease in error, and this repre- 10714 
sents a power law behaviour since the 
axes are logarithmic. The slope of this 
line is 1 which shows that this error be- 
haves like O(c). At some point the eval- 
uated gradient reaches the limit of nu- 
merical round-off and further reduction 
in e leads to a noisy line, which again 
follows a power law but where the error 10-18 
now increases with decreasing «. The 

blue curve shows the corresponding re- 

sult for central differences (8.25). We 

see a much smaller error compared to ig 
finite differences, and the slope of the 

line is 2 which shows that the error is 

ole). 


10716 | WA central differences 


Error 


8.1.5 The Jacobian matrix 


We have seen how the derivatives of an error function with respect to the weights 
can be obtained by propagating errors backwards through the network. Backprop- 
agation can also be used to calculate other derivatives. Here we consider the eval- 
uation of the Jacobian matrix, whose elements are given by the derivatives of the 
network outputs with respect to the inputs: 


(8.26) 


where each such derivative is evaluated with all other inputs held fixed. Jacobian 
matrices play a useful role in systems built from a number of distinct modules, as 
illustrated in Figure 8.3. Each module can comprise a fixed or learnable function, 
which can be linear or nonlinear, so long as it is differentiable. 

Suppose we wish to minimize an error function FE with respect to the parameter 
w in Figure 8.3. The derivative of the error function is given by 


(8.27) 


OE _ y OE Oy, 02; 
Ow  <~* Dyn Oz; Ow 
J 
in which the Jacobian matrix for the red module in Figure 8.3 appears as the middle 
term on the right-hand side. 

Because the Jacobian matrix provides a measure of the local sensitivity of the 
outputs to changes in each of the input variables, it also allows any known errors Ax; 
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Figure 8.3 Illustration of a modular 
deep learning architecture in which the u x 
Jacobian matrix can be used to back- 
propagate error signals from the out- 
puts through to earlier modules in the 
system. 


associated with the inputs to be propagated through the trained network to estimate 
their contribution Ayx to the errors at the outputs, through the relation 


a 
Ay. ~ > Fe Ati, (8.28) 


which assumes that the | Az;| are small. In general, the network mapping represented 
by a trained neural network will be nonlinear, and so the elements of the Jacobian 
matrix will not be constants but will depend on the particular input vector used. Thus, 
(8.28) is valid only for small perturbations of the inputs, and the Jacobian itself must 
be re-evaluated for each new input vector. 

The Jacobian matrix can be evaluated using a backpropagation procedure that is 
like the one derived earlier for evaluating the derivatives of an error function with 
respect to the weights. We start by writing the element Jg; in the form 


ee OYR y yk Oa; 
J 


Ox; Oa; Ox; 
= yi” (8.29) 
F Oa; 


where we have made use of (8.5). The sum in (8.29) runs over all units 7 to which 
the input unit 7 sends connections (for example, over all units in the first hidden 
layer in the layered topology considered earlier). We now write down a recursive 
backpropagation formula for the derivatives Oy; /Oa,;: 


OYK T yk Oa 
ða; 2 Oa; Oa; 


OYk 
= h'(a;) a Wii Fa (8.30) 


where the sum runs over all units / to which unit 7 sends connections (corresponding 
to the first index of w;,;). Again, we have made use of (8.5) and (8.6). This back- 
propagation starts at the output units, for which the required derivatives can be found 
directly from the functional form of the output-unit activation function. For linear 
output units, we have 

OYK _ 


ZAN 31 
Da; Okl (8.31) 
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where ôx; are the elements of the identity matrix and are defined by 


i= fi BES (8.32) 


0, otherwise. 


If we have individual logistic sigmoid activation functions at each output unit, then 


Yk 
—= = 6,0" 8.33 
Da, "O (ai) (8.33) 
whereas for softmax outputs, we have 
o 
an = ÔklYk — YkYl- (8.34) 
a 


We can summarize the procedure for calculating the Jacobian matrix as follows. 
Apply the input vector corresponding to the point in input space at which the Jaco- 
bian matrix is to be evaluated, and forward propagate in the usual way to obtain the 
states of all the hidden and output units in the network. Next, for each row k of the 
Jacobian matrix, corresponding to the output unit k, backpropagate using the recur- 
sive relation (8.30), starting with (8.31), (8.33) or (8.34), for all the hidden units in 
the network. Finally, use (8.29) for the backpropagation to the inputs. The Jacobian 
can also be evaluated using an alternative forward propagation formalism, which can 
be derived in an analogous way to the backpropagation approach given here. 

Again, the implementation of such algorithms can be checked using numerical 
differentiation in the form 


Oye  Yk(zi + €) — yk(zi — €) 
Ox; 2€ 


O(@), (8.35) 


which involves 2D forward propagation passes for a network having D inputs and 
therefore requires O( DW) steps in total. 


8.1.6 The Hessian matrix 


We have shown how backpropagation can be used to obtain the first derivatives 
of an error function with respect to the weights in the network. Backpropagation can 
also be used to evaluate the second derivatives of the error, which are given by 


E 
c. (8.36) 
ðwjiðwı k 
It is often convenient to consider all the weight and bias parameters as elements w; 
of a single vector, denoted w, in which case the second derivatives form the elements 
H,; of the Hessian matrix H: 


OE 
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8.1. Evaluation of Gradients 243 


where i,j € {1,...,W} and W is the total number of weights and biases. The 
Hessian matrix arises in several nonlinear optimization algorithms used for training 
neural networks based on considerations of the second-order properties of the error 
surface (Bishop, 2006). It also plays a role in some Bayesian treatments of neural 
networks (MacKay, 1992; Bishop, 2006) and has been used to reduce the precision 
of the weights in large language models to lessen their memory footprint (Shen et al., 
2019). 

An important consideration for many applications of the Hessian is the efficiency 
with which it can be evaluated. If there are W parameters (weights and biases) in 
the network, then the Hessian matrix has dimensions W x W and so the compu- 
tational effort needed to evaluate the Hessian will scale like O(W7) for each point 
in the data set. Extension of the backpropagation procedure (Bishop, 1992) allows 
the Hessian matrix to be evaluated efficiently with a scaling that is indeed O(W7”). 
Sometimes, we do not need the Hessian matrix explicitly but only the product v'H 
of the Hessian with some vector v, and this product can be calculated efficiently 
in O(W) steps using an extension of backpropagation (Møller, 1993; Pearlmutter, 
1994). 

Since neural networks may contain millions or even billions of parameters, eval- 
uating, or even just storing, the full Hessian matrix for many models is infeasible. 
Evaluating the inverse of the Hessian is even more demanding as this has O(W°) 
computational scaling. Consequently there is interest in finding effective approxi- 
mations to the full Hessian. 

One approximation involves simply evaluating only the diagonal elements of 
the Hessian and implicitly setting the off-diagonal elements to zero. This requires 
O(W) storage and allows the inverse to be evaluated in O(W) steps but still requires 
O(W7) computation (Ricotti, Ragazzini, and Martinelli, 1988), although with fur- 
ther approximation this can be reduced to O(W) steps (Becker and LeCun, 1989; 
LeCun, Denker, and Solla, 1990). In practice, however, the Hessian generally has 
significant off-diagonal terms, and so this approximation must be treated with care. 

A more convincing approach, known as the outer product approximation, is 
obtained as follows. Consider a regression application using a sum-of-squares error 
function of the form 


1 
E==S (yn —tn)? (8.38) 
n=1 
where we have considered a single output to keep the notation simple (the extension 
to several outputs is straightforward). We can then write the Hessian matrix in the 
form 


N N 
H = VVE = X Vyn(Vyn)” + X (yn — tr) VV Yn (8.39) 
n=1 n=1 


where V denotes the gradient with respect to w. If the network has been trained 
on the data set and its outputs yn are very close to the target values t,,, then the 
final term in (8.39) will be small and can be neglected. More generally, however, 
it may be appropriate to neglect this term based on the following argument. Recall 
from Section 4.2 that the optimal function that minimizes a sum-of-squares loss is 
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the conditional average of the target data. The quantity (yn — tn) is then a random 
variable with zero mean. If we assume that its value is uncorrelated with the value 
of the second derivative term on the right-hand side of (8.39), then the whole term 
will average to zero in the summation over n. 

By neglecting the second term in (8.39), we arrive at the Levenberg—Marquarat 
approximation, also known as the outer product approximation because the Hessian 
matrix is built up from a sum of outer products of vectors, given by 


N 
H ~ 5. Van Vaz. (8.40) 


n=1 


Evaluating the outer product approximation for the Hessian is straightforward as it 
involves only first derivatives of the error function, which can be evaluated efficiently 
in O(W) steps using standard backpropagation. The elements of the matrix can then 
be found in O(W7%) steps by simple multiplication. It is important to emphasize 
that this approximation is likely to be valid only for a network that has been trained 
appropriately, and that for a general network mapping, the second derivative terms 
on the right-hand side of (8.39) will typically not be negligible. 

For a cross-entropy error function for a network with logistic-sigmoid output- 
unit activation functions, the corresponding approximation is given by 


N 
H~ X` yn(1 — Yn)Van Va. (8.41) 
n=1 


An analogous result can be obtained for multi-class networks having softmax output- 
unit activation functions. The outer product approximation can also be used to de- 
velop an efficient sequential procedure for approximating the inverse of a Hessian 
(Hassibi and Stork, 1993). 


Automatic Differentiation 


We have seen the importance of using gradient information to train neural networks 
efficiently. There are essentially four ways in which the gradient of a neural network 
error function can be evaluated. 

The first approach, which formed the mainstay of neural networks for many 
years, is to derive the backpropagation equations by hand and then to implement 
them explicitly in software. If this is done carefully it results in efficient code that 
gives precise results that are accurate to numerical precision. However, the process 
of deriving the equations as well as the process of coding them both take time and are 
prone to errors. It also results in some redundancy in the code because the forward 
propagation equations are coded separately from the backpropagation equations. As 
these often involve duplicated calculations, then if the model is altered, both the 
forward and backward implementations need to be changed in unison. This effort 
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can easily become a limitation on how quickly and effectively different architectures 
can be explored empirically. 

A second approach is to evaluate the gradients numerically using finite differ- 
ences. This requires only a software implementation of the forward propagation 
equations. One problem with numerical differentiation is that it has limited com- 
putational accuracy, although this is unlikely to be an issue for network training as 
we may be using stochastic gradient descent in which each evaluation is only a very 
noisy estimate of the local gradient. The main drawback of this approach is that it 
scales poorly with the size of the network. However, the technique is useful for de- 
bugging other approaches, because the gradients are evaluated using only the forward 
propagation code and so can be used to confirm the correctness of backpropagation 
or other code used to evaluate gradients. 

A third approach is called symbolic differentiation and makes use of specialist 
software to automate the analytical manipulations that are done by hand in the first 
approach. This process is an example of computer algebra or symbolic computation 
and involves the automatic application of the rules of calculus, such as the chain 
rule, in a completely mechanistic process. The resulting expressions are then imple- 
mented in standard software. An obvious advantage of this approach is that it avoids 
human error in the manual derivation of the backpropagation equations. Moreover, 
the gradients are again calculated to machine precision, and the poor scaling seen 
with numerical differentiation is avoided. The major downside of symbolic differ- 
entiation, however, is that the resulting expressions for derivatives can become ex- 
ponentially longer than the original function, with correspondingly long evaluation 
times. Consider a function f(x) given by the product of u(x) and v(x). The function 
and its derivative are given by 


f(z) = u(x)o(2) (8.42) 
f'(@) = w'(@)o(a) + u(x)v(x). (8.43) 


We see that there is redundant computation in that u(x) and v(x) must be evalu- 
ated both for the calculation of f(x) and for f'(x). If the factors u(x) and v(x) 
themselves involve factors, then we end up with a nested duplication of expressions, 
which rapidly grow in complexity. This problem is called expression swell. 

As a further illustration, consider a function that is structured like two layers of 
a neural network (Grosse, 2018) with a single input x, a hidden unit with activation 
z, and an output y in which 


z = h(wix + bı) (8.44) 
where h(a) is the soft ReLU: 
¢(a) = ln (1 + exp(a)). (8.46) 


The overall function is therefore given by 


ylz) = h (wzh(wix + b1) + b2) (8.47) 
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and the derivative of the network output with respect to w,, evaluated symbolically, 
is given by 


Oy Wa exp (wiz + bi + ba + we lnf1 T ge) (8.48) 
dw, (1+ ew +1) (1 + exp(be + we In[l + ewit+h])) ” l 


As well as being significantly more complex than the original function, we also see 
redundant computation where expressions such as w,x + b, occur in several places. 

A further major drawback with symbolic differentiation is that it requires that 
the expression to be differentiated is expressed in closed form. It therefore excludes 
important control flow operations such as loops, recursions, conditional execution, 
and procedure calls, which are valuable constructs that we might wish to use when 
defining the network function. 

We therefore turn to the fourth technique for evaluating derivatives in neural 
networks called automatic differentiation, also known as ‘autodiff’ or ‘algorithmic 
differentiation’ (Baydin et al., 2018). Unlike symbolic differentiation, the goal of 
automatic differentiation is not to find a mathematical expression for the derivatives 
but to have the computer automatically generate the code that implements the gra- 
dient calculations given only the code for the forward propagation equations. It 
is accurate to machine precision, just as with symbolic differentiation, but is more 
efficient because it is able to exploit intermediate variables used in the definition 
of the forward propagation equations and thereby avoid redundant evaluations. It 
is important to note that not only can automatic differentiation handle conventional 
closed-form mathematical expressions but it can also deal with flow control elements 
such as branches, loops, recursion, and procedure calls, and is therefore significantly 
more powerful than symbolic differentiation. Automatic differentiation is a well- 
established field with broad applicability that was developed largely outside of the 
machine learning community. Modern deep learning is a largely empirical process, 
involving evaluating and comparing different architectures, and automatic differenti- 
ation therefore plays a key role in enabling this experimentation to be done accurately 
and efficiently. 

The key idea of automatic differentiation is to take the code that evaluates a func- 
tion, for example the forward propagation equations that evaluate the error function 
for a neural network, and augment the code with additional variables whose values 
are accumulated during code execution to obtain the required derivatives. There are 
two principal forms of automatic differentiation, known as forward mode and re- 
verse mode. We start by looking at forward mode, which is conceptually somewhat 
simpler. 


8.2.1 Forward-mode automatic differentiation 


In forward-mode automatic differentiation, we augment each intermediate vari- 
able z;, known as a ‘primal’ variable, involved in the evaluation of a function, such 
as the error function of a neural network, with an additional variable representing 
the value of some derivative of that variable, which we can denote 2;, known as a 
‘tangent’ variable. The tangent variables and their associated code are generated 


Figure 8.4 Evaluation trace diagram showing 
the steps involved in the numerical evaluation 
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viv? exp(vs) us + U6 


Tı 
of the function (8.49) using the primal equations By © O Om O f 


(8.50) to (8.56). 


x2 


sin(v2) V3 — V4 


automatically by the software environment. Instead of simply doing forward prop- 
agation to compute {z;}, the code now propagates tuples (z;, 2;) so that variables 
and derivatives are evaluated in parallel. The original function is generally defined 
in terms of elementary operators consisting of arithmetic operations and negation 
as well as transcendental functions such as exponential, logarithm, and trigonomet- 
ric functions, all of which have simple formulae for their derivatives. Using these 
derivatives in combination with the chain rule of calculus allows the code used to 
evaluate gradients to be constructed automatically. 

As an example, consider the following function, which has two input variables: 


f(@1, £2) = ©7122 + exp(x1 22) — sin(r2). (8.49) 


When implemented in software, the code consists of a sequence of operations that 
can be expressed as an evaluation trace of the underlying elementary operations. 
This trace can be visualized in the form of a graph, as shown in Figure 8.4. Here we 
have defined the following primal variables 


vı = Tı (8.50) 
V2 = T2 (8.51) 
U3 = V12 (8.52) 
v4 = sin(v2) (8.53) 
Us = exp(vs) (8.54) 
Ug = U3 — U4 (8.55) 
Uz = Us + Ug. (8.56) 


Now suppose we wish to evaluate the derivative Of /Ox,. We define the tangent 
variables by ù; = Ov;/Ox;. Expressions for evaluating these can be constructed 
automatically using the chain rule of calculus: 


. Ov; Ov; Ov; _ Ov; 
ta Dyin te, (8.57) 
n J 


where pa(i) denotes the set of parents of the node i in the evaluation trace diagram, 
that is the set of variables with arrows pointing to node 7. For example, in Figure 8.4 
the parents of node v3 are nodes vı and vz. Applying (8.57) to the evaluation trace 
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equations (8.50) to (8.56), we obtain the following evaluation trace equations for the 
tangent variables 


ù= 1 (8.58) 
ù =0 (8.59) 
U3 = VÙ + Ù1V2 (8.60) 
04 = Ùz cos(v2) (8.61) 
ds = U3 exp(v3) (8.62) 
0g = V3 — 04 (8.63) 
U7 = Us + U6. (8.64) 


We can summarize automatic differentiation for this example as follows. We 
first write code to implement the evaluation of the primal variables, given by (8.50) to 
(8.56). The associated equations and corresponding code for evaluating the tangent 
variables (8.58) to (8.64) are generated automatically. To evaluate the derivative 
Of /Ox1, we input specific values of x, and x2 and the code then executes the primal 
and tangent equations, numerically evaluating the tuples (v;, ù;) in sequence until 
we obtain ùs, which is the required derivative. 

Now consider an example with two outputs f1(21, £2) and f2(21, £2) where 
fi (x1, £2) is defined by (8.49) and 


fo(x1, £2) = (1122 = sin(22)) exp(2122) (8.65) 


as illustrated by the evaluation trace diagram in Figure 8.5. We see that this in- 
volves only a small extension to the evaluation equations for the primal and tangent 
variables, and so both Of; /Ox, and Of2/0x, can be evaluated together in a single 
forward pass. The downside, however, is that if we wish to evaluate derivatives with 
respect to a different input variable x2 then we have to run a separate forward pass. 
In general, if we have a function with D inputs and K outputs then a single pass 
of forward-mode automatic differentiation produces a single column of the K x D 
Jacobian matrix: 


ofi Ofi 
a oe 
J= : F : (8.66) 
EE 
Ox, _ Oxp 
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To compute column 7 of the Jacobian, we need to initialize the forward pass of the 
tangent equations by setting t; = 1 and t; = 0 for i ¢ j. We can write this in vector 
form as x = e; where e; is the ith unit vector. To compute the full Jacobian matrix 
we need D forward-mode passes. However, if we wish to evaluate the product of the 
Jacobian with a vector r = (r1,..., rp)": 


ofi ofi Tı 
|æ ao ||” | 


fr fr 


Ox,  Oxp rp 


J= (8.67) 


then this can be done in single forward pass by setting x = r. 

We see that forward-mode automatic differentiation can evaluate the full K x D 
Jacobian matrix of derivatives using D forward passes. This is very efficient for net- 
works with a few inputs and many outputs, such that K >> D. However, we often 
operate in a regime where we often have just one function, namely the error function 
used for training, and large numbers of variables that we want to differentiate with 
respect to, comprising the weights and biases in the network, of which there may 
be millions or billions. In such situations, forward-mode automatic differentiation is 
extremely inefficient. We therefore turn to an alternative version of automatic dif- 
ferentiation based on the a backwards flow of derivative data through the evaluation 
trace graph. 


8.2.2 Reverse-mode automatic differentiation 


We can think of reverse-mode automatic differentiation as a generalization of 
the error backpropagation procedure. As with forward mode, we augment each in- 
termediate variable v; with additional variables, in this case called adjoint variables, 
denoted 0;. Consider again a situation with a single output function f for which the 
adjoint variables are defined by 

Sou Of 


Vi = Bu; > 
These can be evaluated sequentially starting with the output and working backwards 
by using the chain rule of calculus: 


jEch(i) Ov; Ovi j€ch(i) i 


(8.68) 


Here ch(i) denotes the children of node i in the evaluation trace graph, in other 
words the set of nodes that have arrows pointing into them from node 7. The succes- 
sive evaluation of the adjoint variables represents a flow of information backwards 
through the graph, as we saw previously. 

If we again consider the specific example function given by (8.50) to (8.56), we 
obtain the following evaluation equations for the evaluation of the adjoint variables 
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7 =1 (8.70) 
Ue = U7 (8.71) 
Us = U7 (8.72) 
U4 = —U6 (8.73) 
U3 = UsUs + U6 (8.74) 
U2 = V2v1 + U4 cos(v2) (8.75) 
U1 = U302. (8.76) 


Note that these start at the output and then flow backwards through the graph to the 
inputs. Even with multiple inputs, only a single backward pass is required to evaluate 
the derivatives. For a neural network error function, the derivatives of Æ with respect 
to the weight and biases are obtained as the corresponding adjoint variables. How- 
ever, if we now have more than one output then we need to run a separate backward 
pass for each output variable. 

Reverse mode is often more memory intensive than forward mode because all 
of the intermediate primal variables must be stored so that they will be available as 
needed when evaluating the adjoint variables during the backward pass. By contrast, 
with forward mode, the primal and tangent variables are computed together during 
the forward pass, and therefore variables can be discarded once they have been used. 
It is therefore also generally easier to implement forward mode compared to reverse 
mode. 

For both forward-mode and reverse-mode automatic differentiation, a single 
pass through the network is guaranteed to take no more than 6 times the compu- 
tational cost of a single function evaluation. In practice, the overhead is typically 
closer to a factor of 2 or 3 (Griewank and Walther, 2008). Hybrids of forward and 
reverse modes are also of interest. One situation in which this arises is in the eval- 
uation of the product of a Hessian matrix with a vector, which can be calculated 
without explicit evaluation of the full Hessian (Pearlmutter, 1994). Here we can use 
reverse mode to calculate the gradient of code, which itself has been generated by the 
forward model. We start from a vector b and a point x at which the Hessian—vector 
product is to be evaluated. By setting x = v and using forward mode, we obtain 
the directional derivative v' Vf. This is then differentiated using reverse mode to 
obtain V? fv = Hv. If W is the number of parameters in the neural network then 
this evaluation has O(W) complexity even though the Hessian is of size W x W. 
The Hessian itself can also be evaluated explicitly using automatic differentiation 
but this has O(W?) complexity. 


(x) By making use of (8.5), (8.6), (8.8), and (8.12), verify the backpropagation for- 
mula (8.13) for evaluating the derivatives of an error function. 


(« x) Consider a network that consists of layers and rewrite the backpropagation 
formula (8.13) in matrix notation by starting with the forward propagation equation 
(6.19). Note that the result involves multiplication by the transposes of the matrices. 
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(x) By using a Taylor expansion, verify that the terms that are © (e) cancel on the 
right-hand side of (8.25). 


(x x) Consider a two-layer network of the form shown in Figure 6.9 with the addition 
of extra parameters corresponding to skip-layer connections that go directly from the 
inputs to the outputs. By extending the discussion of Section 8.1.3, write down the 
equations for the derivatives of the error function with respect to these additional 
parameters. 


(x x x) In Section 8.1.5, we derived a procedure for evaluating the Jacobian matrix of 
a neural network using a backpropagation procedure. Derive an alternative formal- 
ism for finding the Jacobian based on forward propagation equations. 


(x x x) Consider a two-layer neural network, and define the quantities 


oe OPEn 


ôk = ; Men = ———. 
‘ Oar Kn agak 


(8.77) 


Derive expressions for the elements of the Hessian matrix expressed in terms of ôx 
and Mkp for elements in which (i) both weights are in the second layer, (ii) both 
weights are in the first layer, and (iii) one weight is in each layer. 


(xxx) Extend the results of Exercise 8.6 for the exact Hessian of a two-layer network 
to include skip-layer connections that go directly from inputs to outputs. 


(x x) The outer product approximation to the Hessian matrix for a neural network us- 
ing a sum-of-squares error function is given by (8.40). Extend this result for multiple 
outputs. 


(x x) Consider a squared-loss function of the form 
1 
E(w) = 5 iy {y(x, w) — t¥ p(x, t) dx dt (8.78) 


where y(x, w) is a parametric function such as a neural network. The result (4.37) 
shows that the function y(x, w) that minimizes this error is given by the conditional 
expectation of t given x. Use this result to show that the second derivative of Æ with 
respect to two elements w, and w, of the vector w, is given by 


OE Oy Oy 


Paes: = iu p(x) dx. (8.79) 


Note that, for a finite sample from p(x), we obtain (8.40). 


(x x) Derive the expression (8.41) for the outer product approximation of a Hessian 
matrix for a network having a single output with a logistic-sigmoid output-unit acti- 
vation function and a cross-entropy error function, corresponding to the result (8.40) 
for the sum-of-squares error function. 
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8.11 


8.12 


8.13 
8.14 


8.15 


8.16 


8.17 


8.18 


(x x) Derive an expression for the outer product approximation of a Hessian matrix 
for a network having K outputs with a softmax output-unit activation function and 
a cross-entropy error function, corresponding to the result (8.40) for the sum-of- 
squares error function. 


(x x x) Consider the matrix identity 


(M~'v) (vTM?!) 
1+vTM-!iv ’ 


(M+vvT) =M7! (8.80) 
which is simply a special case of the Woodbury identity (A.7). By applying (8.80) 
to the outer product approximation (8.40) for a Hessian matrix, derive a formula that 
allows the inverse of the Hessian matrix to be computed by making one pass through 
the training data and updating the inverse Hessian with each data point. Note that 
the algorithm can initialized using H = aI where a is a small constant, and that the 
results are not particularly sensitive to the precise value of a. 


(xx) Verify that the derivative of (8.47) is given by (8.48). 


(xx) The logistic map is a function defined by the iterative relation Ln+ı(x) = 
AL, (a) (1 — Ln(a)) with Lı (a) = x. Write down the evaluation trace equations for 
L(x), L3(a), and L4(a), and then write down expressions for the corresponding 
derivatives Li (x), L5(x), L5(a), and L4(x). Do not simplify the expressions but 
instead simply note how the complexity of the formulae for the derivatives grows 
much more rapidly than the expressions for the functions themselves. 


(xx) Starting from the evaluation trace equations (8.50) to (8.56) for the example 
function (8.49), use (8.57) to derive the forward-mode tangent variable evaluation 
equations (8.58) to (8.64). 


(xx) Starting from the evaluation trace equations (8.50) to (8.56) for the example 
function (8.49), and referring to Figure 8.4, use (8.69) to derive the reverse-mode 
adjoint variable evaluation equations (8.70) to (8.76). 


(x * x) Consider the example function (8.49). Write down an expression for Of /Ox1 
and evaluate this function for xı = 1 and z = 2. Now use the evaluation trace equa- 
tions (8.50) to (8.56) to evaluate the variables vı to v7 and then use the evaluation 
trace equations of forward-mode automatic differentiation to evaluate the tangent 
variables ù; to ùy and to confirm that the resulting value of Of /Ox, agrees with that 
found directly. Similarly, use the evaluation trace equations of reverse-mode auto- 
matic differentiation (8.70) to (8.76) to evaluate the adjoint variables 07 to 0, and 
again confirm that the resulting value of ô f /Ox, agrees with that found directly. 


(x x) By expressing an arbitrary vector r = (r1,...,7 p)? as a linear combination of 
the unit vectors e;, where i = 1,..., D, show that the product of the Jacobian of a 
function with r in the form (8.67) can be evaluated using a single pass of forward- 
mode automatic differentiation by setting x = r. 
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Regularization 


We introduced the concept of regularization when discussing polynomial curve fit- 
ting as a way to reduce over-fitting by discouraging the parameters of the model from 
taking values with a large magnitude. This involved adding a simple penalty term to 
the error function to give a regularized error function in the form 


E(w) = E(w) + ww (9.1) 
where w is the vector of model parameters, E (w) is the unregularized error function, 
and the regularization hyperparameter controls the strength of the regularization 
effect. An improvement in predictive accuracy with such a regularizer can be under- 
stood in terms of the bias—variance trade-off through the reduction in the variance of 
the solution at the expense of some increase in bias. In this chapter we will explore 
regularization in depth and will discuss several different approaches to regulariza- 
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9.1. 


tion. We will also look more broadly at the important role of bias in achieving good 
generalization from finite training data sets. 

In a practical application, it is very unlikely that the process that generates the 
data will correspond precisely to a particular neural network architecture, and so 
any given neural network will only ever represent an approximation to the true data 
generator. Larger networks can provide closer approximations, but this comes at the 
risk of over-fitting. In practice, we find that the best generalization results are almost 
always obtained by using a larger network combined with some form of regular- 
ization. In this chapter we explore several alternative regularization techniques in- 
cluding early stopping, model averaging, dropout, data augmentation, and parameter 
sharing. Multiple forms of regularization can be used together if desired. For exam- 
ple, error function regularization of the form (9.1) is often used alongside dropout. 


Inductive Bias 


When we compared the predictive error of polynomials of various orders for the si- 
nusoidal synthetic data problem, we saw that the smallest generalization error was 
achieved using a polynomial of intermediate complexity, being neither too simple 
nor too flexible. A similar result was found when we used a regularization term of 
the form (9.1) to control model complexity, as an intermediate value of the regular- 
ization coefficient À gave the best predictions for new input values. Insight into this 
result came from the bias—variance decomposition, where we saw that an appropriate 
level of bias in the model was important to allow generalization from finite data sets. 
Simple models with high bias are unable to capture the variation in the underlying 
data generation process, whereas highly flexible models with low bias are prone to 
over-fitting leading to poor generalization. As the size of the data set grows, we can 
afford to use more flexible models having less bias without incurring excessive vari- 
ance, thereby leading to improved generalization. Note that in a practical setting, our 
choice of model might also be influenced by factors such as memory usage or speed 
of execution. Here we ignore such ancillary considerations and focus on the core 
goal of achieving good predictive performance, in other words good generalization. 


9.1.1 Inverse problems 


This issue of model choice lies at the heart of machine learning and can be 
traced to the fact that most machine learning tasks are examples of inverse prob- 
lems. Given a conditional distribution p(t|x) along with a finite set of input points 
{x,,...,X}, it is straightforward, at least in principle, to sample corresponding 
values {t,,...,¢,} from that distribution. In machine learning, however, we have 
to solve the inverse of this problem, namely to infer an entire distribution given only a 
finite number of samples. This is intrinsically ill-posed, as there are infinitely many 
distributions which could potentially have been responsible for generating the ob- 
served training data. In fact any distribution that has a non-zero probability density 
at the observed target values is a candidate. 
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For machine learning to be useful, however, we need to make predictions for 
new values of x, and therefore we need a way to choose a specific distribution from 
amongst the infinitely many possibilities. The preference for one choice over oth- 
ers is called inductive bias, or prior knowledge, and plays a central role in machine 
learning. Prior knowledge may come from background information that helps con- 
strain the space of solutions. For many applications, small changes in the input 
values should lead to small changes in the output values, and so we should bias our 
solutions towards those with smoothly varying functions. Regularization terms of 
the form (9.1) encourage the model weights to have a smaller magnitude and hence 
introduce a bias towards functions that vary more slowly with changes in the inputs. 
Likewise, when detecting objects in images, we can introduce prior knowledge that 
the identity of an object is generally independent of its location within the image. 
This is known as translation invariance, and incorporating this into our solution can 
greatly simplify the task of building a system with good generalization. However, 
care must be taken not to incorporate biases or constraints that are inconsistent with 
the underlying process that generates the data. For example, assuming that the re- 
lationship between outputs and inputs is linear when in fact there are significant 
nonlinearities can lead to a system that produces inaccurate answers. 

Techniques such as transfer learning and multi-task learning can also be viewed 
from the perspective of regularization. When training data for a particular task is 
limited, additional data from a different, but related, task can be used to help de- 
termine the learnable parameters in a neural network. The assumption of similarity 
between the tasks represents a more sophisticated form of inductive bias compared 
to simple regularization, and this explains the improved performance resulting from 
the use of the additional data. 


9.1.2 No free lunch theorem 


The core focus of this book is on the important class of machine learning mod- 
els called deep neural networks. These are highly flexible models and have revo- 
lutionized many fields including computer vision, speech recognition, and natural 
language processing. In fact, they have become the framework of choice for the 
great majority of machine learning applications. It might appear, therefore, that they 
represent a ‘universal’ learning algorithm able to solve all tasks. However, even very 
flexible neural networks contain important inductive biases. For example, convolu- 
tional neural networks encode specific forms of inductive bias, including translation 
equivariance, that are especially useful in applications involving images. 

The no free lunch theorem (Wolpert, 1996), named from the expression “There’s 
no such thing as a free lunch,’ states that every learning algorithm is as good as any 
other when averaged over all possible problems. If a particular model or algorithm 
is better than average on some problems, it must be worse than average on others. 
However, this is a rather theoretical notion as the space of possible problems here 
includes relationships between input and output that would be very uncharacteristic 
of any plausible practical application. For example, we have already noted that most 
examples of practical interest exhibit some degree of smoothness, in which small 
changes in the input values are associated, for the most part, with small changes in 


256 


Chapter 7 


9. REGULARIZATION 


the target values. Models such as neural networks, and indeed most widely used 
machine learning techniques, exhibit this form of inductive bias, and therefore to 
some degree, they have broad applicability. 

Although the no free lunch theorem is somewhat theoretical, it does highlight 
the central importance of bias in determining the performance of a machine learning 
algorithm. It is not possible to learn “purely from data’ in the absence of any bias. In 
practice, bias may be implicit. For example, every neural network has a finite number 
of parameters, which therefore limits the functions that it can represent. However, 
bias may also be encoded explicitly as a reflection of prior knowledge relating to the 
specific type of problem being solved. 

In trying to find general-purpose learning algorithms, we are really seeking in- 
ductive biases that are appropriate to the broad classes of applications that will be 
encountered in practice. For any given application, however, better results can be ob- 
tained if it is possible to incorporate stronger inductive biases that are specific to that 
application. The perspective of model-based machine learning (Winn et al., 2023) 
advocates making all the assumptions in machine learning models explicit so that 
the appropriate choices can be made for inductive biases. 

We have seen that inductive bias can be incorporated through the form of dis- 
tribution, for example by specifying that the output is a linear function of a fixed 
set of specific basis functions. It can also be incorporated through the addition of 
a regularization term to the error function used during training. Yet another way to 
control the complexity of a neural network is through the training process itself. We 
will see that deep neural networks can give good generalization even when the num- 
ber of adjustable parameters exceeds the number of training data points, provided 
the training process is set up correctly. Part of the skill in applying deep learning to 
real-world problems is in the careful design of inductive bias and the incorporation 
of prior knowledge. 


9.1.3 Symmetry and invariance 


In many applications of machine learning, the predictions should be unchanged, 
or invariant, under one or more transformations of the input variables. For example, 
when classifying an object in two-dimensional images, such as a cat or dog, a par- 
ticular object should be assigned the same classification irrespective of its position 
within the image. This is known as translation invariance. Likewise changes to the 
size of the object within the image should also leave its classification unchanged. 
This is called scale invariance. Exploiting such symmetries to create inductive bi- 
ases can greatly improve the performance of machine learning models and forms the 
subject of geometric deep learning (Bronstein et al., 2021). 

Transformations, such as a translation or scaling, that leave particular properties 
unchanged, represent symmetries. The set of all transformations corresponding to a 
particular symmetry form a mathematical structure called a group. A group consists 
of a set of elements A, B,C, . . . together with a binary operation for composing pairs 
of elements together, which we denote using the notation A o B. The following four 
axioms hold for a group: 
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1. Closed: For any two elements A, B in the set, A o B must also be in the set. 
2. Associative: For any three elements A, B,C in the set, (AoB)oC = Ao(BoC). 


3. Identity: There exists an element Z of the set, called the identity, with the 
property: A o T = T o A = A for every element A in the set. 


4. Inverse: For each element A in the set, there exists another element in the 
set, which we denote by A~', called the inverse, which has the property: A o 
At=A10A=T. 


Simple examples of groups include the set of rotations of a square through multiples 
of 90° or the set of continuous translations of an object in a two-dimensional plane. 

In principle, invariance of the predictions made by a neural network to trans- 
formations of the input space could be learned from data, without any special mod- 
ifications to the network or the training procedure. In practice, however, this can 
prove to be extremely challenging because such transformations can produce sub- 
stantial changes in the raw data. For example, relatively small translations of an 
object within an image, even by a few pixels, can cause pixel values to change sig- 
nificantly. Furthermore, multiple invariances must often hold at the same time, for 
example invariance to translations in two dimensions as well as scaling, rotation, 
changes of intensity, changes of colour balance, and many others. There are expo- 
nentially many possible combinations of such transformations, making the size of 
the required training set needed to learn all of these invariances prohibitive. 

We therefore seek more efficient approaches for encouraging an adaptive model 
to exhibit the required invariances. These can broadly be divided into four categories: 


1. Pre-processing. Invariances are built into a pre-processing stage by computing 
features of the data that are invariant under the required transformations. Any 
subsequent regression or classification system that uses such features as inputs 
will necessarily also respect these invariances. 


2. Regularized error function. A regularization term is added to the error func- 
tion to penalize changes in the model output when the input is subject to one 
of the invariant transformations. 


3. Data augmentation. The training set is expanded using replicas of the training 
data points, transformed according to the desired invariances and carrying the 
same output target values as the untransformed examples. 


4. Network architecture. The invariance properties are built into the structure 
of a neural network through an appropriate choice of network architecture. 


One challenge with approach 1 is to design features that exhibit the required 
invariances without also discarding information that can be useful for determining 
the network outputs. We have already seen that fixed, hand-crafted features have 
limited capabilities and have been superseded by learned representations obtained 
using deep neural networks. 
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Figure 9.1 


Illustration of data set augmentation, showing (a) the original image, (b) horizontal inversion, (c) 


scaling, (d) translation, (e) rotation, (f) brightness and contrast change, (g) additive noise, and (h) colour shift. 


An example of approach 2 is the technique of tangent propagation (Simard et al., 
1992) in which a regularisation term is added to the error function during training. 
This term directly penalizes changes in the output resulting from changes in the input 
variables that correspond to one of the invariant transformations. A limitation of this 
technique, in addition to the extra complexity of training, is can only cope with small 
transformations (e.g., translations by less than a pixel). 

Approach 3 is known as data set augmentation. It is often relatively easy to 
implement and can prove to be very effective in practice. It is often applied in the 
context of image analysis as it straightforward to create the transformed training data. 
Figure 9.1 shows examples of such transformations applied to an image of a cat. 
For medical images of soft tissue, data augmentation could also include continuous 
‘rubber sheet’ deformations (Ronneberger, Fischer, and Brox, 2015). 

For sequential training algorithms, such as stochastic gradient descent, the data 
set can be augmented by transforming each input data point before it is presented 
to the model so that, if the data points are being recycled, a different transformation 
(drawn from an appropriate distribution) is applied each time. For batch methods, a 
similar effect can be achieved by replicating each data point a number of times and 
transforming each copy independently. 

We can analyse the effect of using augmented data by considering transforma- 
tions that represent small changes to the original examples and then making a Taylor 
expansion of the error function in powers of the magnitude of the transformation 
(Bishop, 1995c; Leen, 1995; Bishop, 2006). This leads to a regularized error func- 
tion in which the regularizer penalizes the gradient of the network output with respect 
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to the input variables projected onto the direction of transformation. This is related 
to the technique of tangent propagation discussed above. A special case arises when 
the transformation of the input variables consists simply of the addition of random 
noise, in which case the regularizer penalizes the derivatives of the network outputs 
with respect to the inputs. Again, this is intuitively reasonable, since we are encour- 
aging the network outputs to remain unchanged despite the addition of noise to the 
input variables. 

Finally, approach 4, in which we build invariances into the structure of the net- 
work, has proven to be very powerful and effective and leads to other key benefits. 
We will discuss this approach at length in the context of convolutional neural net- 
works for computer vision. 


9.1.4 Equivariance 


An important generalization of invariance is called equivariance in which the 
output of the network, instead of remaining constant when the input is transformed, 
is itself transformed in a specific way. For example, consider a network that takes 
an image as input and returns a segmentation of that image in which each pixel 
is classified as belonging either to a foreground object or to the background. In 
this case, if the location of the object within the image is translated, we want the 
corresponding segmentation of the object to be similarly translated. Suppose we 
denote the image by I, and the operation of the segmentation network by S, then for 
a translation operation 7 we have 


S(T(1)) = T(S), (9.2) 


which says that the segmentation of the translated image is given by the translation 
of the segmentation of the original image. This is illustrated in Figure 9.2 

More generally, equivariance can hold if the transformation applied to the output 
is different to that applied to the input: 


S(T(1)) =T(S(D). (9.3) 


For example, if the segmented image has a lower resolution than the original image, 
then if 7 is a translation in the original image space, 7 represents the corresponding 
translation in the lower-dimensional segmentation space. Similarly, if S is an oper- 
ator that measures the orientation of an object within an image, and 7 represents a 
rotation (which is a complex nonlinear transformation of all of the pixel values in the 
image) then T will increment or decrement the scalar orientation value generated by 
S. 

We also see that invariance is a special case of equivariance in which the output 
transformation is simply the identity. For example, if C is a neural network that 
classifies an object within an image and 7 is a translation operator then 


C(7T(1)) =C), (9.4) 


which says that the class of the object does not depend on its position within the 
image. 
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Illustration of equivariance, cor- 
responding to (9.2). If an image 
(a) is first translated to give (b) 
and then segmented to give (d), 
the result is the same as if the 
image is first segmented to give 
(c) and then translated to give 
(d). 


Weight Decay 


We introduced regularization in the context of linear regression to control model 
complexity, as an alternative to limiting the number of parameters in the model. The 
simplest regularizer comprises the sum of the squares of the model parameters to 
give a regularized error function of the form (9.1), which penalizes parameter values 
with large magnitude. The effective model complexity is then determined by the 
choice of the regularization coefficient A. 

We have also seen that this additive regularization term can be interpreted as 
the negative logarithm of a zero-mean Gaussian prior distribution over the weight 
vector w. This provides a probabilistic perspective on the inclusion of prior knowl- 
edge into the model training process. Unfortunately, this prior is expressed over the 
model parameters, whereas any domain knowledge we might possess regarding the 
problem to be solved is more likely to be expressed in terms of the network function 
from inputs to outputs. The relationship between the parameters and the network 
function is, however, extremely complex, and therefore only very limited kinds of 
prior knowledge can easily be expressed directly as priors over model parameters. 

The sum-of-squares regularizer in (9.1) is known in the machine learning litera- 
ture as weight decay because in sequential learning algorithms, it encourages weight 
values to decay towards zero, unless supported by the data. One advantage of this 
kind of regularizer is that it is trivial to evaluate its derivatives for use in gradient 
descent training. Specifically for (9.1) the gradient is given by 


V E(w) = VE(w) + Aw. (9.5) 
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Figure 9.3 Contours of the error function (red), the regularization term (green), and a linear combination of 
the two (blue) for a quadratic error function and a sum-of-squares regularizer A(w? + w2). Here the axes in 
parameter space have been rotated to align with the axes of the elliptical contour of the unregularized error 
function. For A = 0, the minimum error is indicated by w*. When à > 0, the minimum of the regularized error 
function E(w) + A(w? + w2) is shifted towards the origin. This shift is greater in the direction of w, because 
the unregularized error is relatively insensitive to the parameter value, and less in direction w2 where the error is 
more strongly dependent on the parameter value. The regularization term is effectively suppressing parameters 
that have only a small effect on the accuracy of the network predictions. 


Section 4.1.2 
Section 8.1.6 


We see that the factor of 1/2 in (9.1), which is often included by convention, disap- 
pears when we take the derivative. 

The general effect of a quadratic regularizer can be seen by considering a two- 
dimensional parameter space along with an unregularized error function E(w) that 
is a quadratic function of w (corresponding to a simple linear regression model with 
a sum-of-squares error function), as illustrated in Figure 9.3. The axes in param- 
eter space have been rotated to align with the eigenvectors of the Hessian matrix, 
corresponding to the axes of the elliptical error function contours. We see that the 
effect of the regularization term is to shrink the magnitudes of the weight parameters. 
However, the effect is much larger for parameter w because the unregularized error 
is much less sensitive to the value of w, compared to that of wə. Intuitively only 
the parameter wə is ‘active’ because the output is relatively insensitive to w, and 
hence the regularizer pushes w, close to zero. The effective number of parameters 
is the number that remain active after regularization, and this concept can be formal- 
ized either from a Bayesian or from a frequentist perspective (Bishop, 2006; Hastie, 
Tibshirani, and Friedman, 2009). For A — oo, all the parameters are driven to zero 
and the effective number of parameters is then zero. As À is reduced, the number 
of parameters increases until for A = 0 it equals the total number of actual param- 
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eters in the model. We see that controlling model complexity by regularization has 
similarities to controlling model complexity by limiting the number of parameters. 


9.2.1 Consistent regularizers 


One of the limitations of simple weight decay in the form (9.1) is that it breaks 
certain desirable transformation properties of network mappings. To illustrate this, 
consider a multilayer perceptron network having two layers of weights and linear 
output units that performs a mapping from a set of input variables {x;} to a set of 
output variables {yx}. The activations of the hidden units in the first hidden layer 


take the form 
Zj = h (x: Wii + un) (9.6) 


7 


whereas the activations of the output units are given by 
Yk = X wkjžj + Wko. (9.7) 
j 
Suppose we perform a linear transformation of the input data: 
Ti `> Ti = Qati + b. (9.8) 


Then we can arrange for the mapping performed by the network to be unchanged 
by making a corresponding linear transformation of the weights and biases from the 
inputs to the units in the hidden layer: 


Wji =} Wji = a8 (9.9) 
= b 
Wjo > Wjo = Wjo — F X wji (9.10) 


Similarly, a linear transformation of the output variables of the network of the form 
Yk Ye = Cyn + d (9.11) 
can be achieved transforming the second-layer weights and biases using 


Wkj > Wkj = CWkj (9.12) 
Wko — Wo = CWko + d. (9.13) 


If we train one network using the original data and one network using data for which 
the input and/or target variables have been transformed by one of the above linear 
transformations, then consistency requires that we should obtain equivalent networks 
that differ only by the linear transformation of the weights as given. Any regularizer 
should be consistent with this property, otherwise it would arbitrarily favour one 
solution over another, equivalent one. Clearly, simple weight decay (9.1), which 
treats all weights and biases on an equal footing, does not satisfy this property. 
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We therefore look for a regularizer that is invariant under the linear transforma- 
tions (9.9), (9.10), (9.12), and (9.13). These require that the regularizer should be 
invariant to re-scaling of the weights and to shifts of the biases. Such a regularizer is 
given by 
BS age Sow (9.14) 

5 ; 


2 
wEWi wEWə2 


where W; denotes the set of weights in the first layer, W denotes the set of weights 
in the second layer, and biases are excluded from the summations. This regularizer 
will remain unchanged under the weight transformations provided the regularization 
parameters are re-scaled using Ay > a!/?, and Ay > c!/? Ao. 

The regularizer (9.14) corresponds to a prior distribution over the parameters of 
the form: 


Qı 2 Q2 2 
= =i . 1 
p(w|ai, @2) x exp ( 3 w 2 20 w (9.15) 


wEeWwy, 


Note that priors of this form are improper (they cannot be normalized) because the 
bias parameters are unconstrained. Using improper priors can lead to difficulties in 
selecting regularization coefficients and in model comparison within the Bayesian 
framework. It is therefore common to include separate priors for the biases (which 
then break the shift invariance) that have their own hyperparameters. 

We can illustrate the effect of the resulting four hyperparameters by drawing 
samples from the prior and plotting the corresponding network functions, as shown 
in Figure 9.4. The priors are governed by four hyperparameters, a?, a, a}, and ay, 
which represent the precisions of the Gaussian distributions of the first-layer biases, 
first-layer weights, second-layer biases, and second-layer weights, respectively. We 
see that the parameter a} governs the vertical scale of the functions (note the dif- 
ferent vertical axis ranges on the top two diagrams), a} governs the horizontal scale 
of variations in the function values, and a? governs the horizontal range over which 
variations occur. The parameter a}, whose effect is not illustrated here, governs the 
range of the vertical offsets of the functions 

More generally, we can consider regularizers in which the weights are divided 
into any number of groups Wp so that 


1 2 
Q(w) = g 2, cnllwil (9.16) 
where 
whi = So w. (9.17) 
JEWk 


For example, we could use a different regularizer for each layer in a multilayer net- 
work. 


264 9. REGULARIZATION 


6 ; 60 ; l ; 
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 
a¥ = 1000, a? = 100, a¥ = 1, a} = 1 a¥ = 1000, a? = 1000, a¥ = 1,a} = 1 
5 i r r 5 r r r 
0 0 
-5f -5 
10 ; 10 
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 


Figure 9.4 Illustration of the effect of the hyperparameters governing the prior distribution over weights and 
biases in a two-layer network having a single input, a single linear output, and 12 hidden units with tanh activation 
functions. 


9.2.2 Generalized weight decay 


A generalization of the simple quadratic regularizer is sometimes used: 
pe 
Q(w) = 5 S > wl? (9.18) 
j=l 


where q = 2 corresponds to the quadratic regularizer in (9.1). Figure 9.5 shows 
contours of the regularization function for different values of q. 

A regularizer of the form (9.18) with q = 1 is known as a lasso in the statistics 
literature (Tibshirani, 1996). For quadratic error functions, it has the property that 
if A is sufficiently large, some of the coefficients w; are driven to zero, leading to a 
sparse model in which the corresponding basis functions play no role. To see this, 


Figure 9.5 Contours of the regu- 
larization term in (9.18) for various 
values of the parameter q. 
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E(w) + as PAK (9.19) 


is equivalent to minimizing the unregularized error function E(w) subject to the 
constraint 


So lwl <n (9.20) 


for an appropriate value of the parameter 7, where the two approaches can be related 
using Lagrange multipliers. The origin of the sparsity can be seen in Figure 9.6, 
which shows the minimum of the error function, subject to the constraint (9.20). As 
A is increased, more parameters will be driven to zero. By comparison, a quadratic 
regularizer leaves both weight parameters with non-zero values. 

Regularization allows complex models to be trained on data sets of limited size 
without severe over-fitting, essentially by limiting the effective model complexity. 
However, the problem of determining the optimal model complexity is then shifted 
from one of finding the appropriate number of learnable parameters to one of deter- 
mining a suitable value of the regularization coefficient A. We will discuss the issue 
of model complexity in the next section. 
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Figure 9.6 Plot of the contours of the unregularized error function (red) along with the constraint region (9.20) 
for the lasso regularizer q = 1 on the left, and the quadratic regularizer q = 2 on the right, in which the optimum 
value for the parameter vector w is denoted by w. The lasso gives a sparse solution in which #1 = 0, whereas 
the quadratic regularizer simply reduces w: to a smaller value. 


9.3. Learning Curves 


We have already explored how the generalization performance of a model varies as 
we change the number of parameters in the model, the size of the data set, and the 
coefficient of a weight-decay regularization term. Each of these allows for a trade- 
off between bias and variance to minimize the generalization error. Another factor 
that influences this trade-off is the learning process itself. During optimization of 
the error function through gradient descent, the training error typically decreases 
as the model parameters are updated, whereas the error for hold-out data may be 
non-monotonic. This behaviour can be visualized using learning curves, which plot 
performance measures such as training set and validation set error as a function of 
iteration number during an iterative learning process such as stochastic gradient de- 
scent. These curves provide insight into the progress of training and also offer a 
practical methodology for controlling the effective model complexity. 


9.3.1 Early stopping 


An alternative to regularization as a way of controlling the effective complexity 
of a network is early stopping. The training of deep learning models involves an 
iterative reduction of the error function defined with respect to a set of training data. 
Although the error function evaluated using the training set often shows a broadly 
monotonic decrease as a function of the iteration number, the error measured with 
respect to held-out data, generally called a validation set, often shows a decrease at 
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Figure 9.7 An illustration of the behaviour of training set error (left) and validation set error (right) during a 
typical training session, as a function of the iteration step, for the sinusoidal data set. To achieve the best 
generalization performance , the training should be stopped at the point shown by the vertical dashed lines, 
corresponding to the minimum of the validation set error. 
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first, followed by an increase as the network starts to over-fit. Therefore, to obtain 
a network with good generalization performance, training should be stopped at the 
point of smallest error with respect to the validation data set, as indicated in Fig- 
ure 9.7. 

This behaviour of the learning curves is sometimes explained qualitatively in 
terms of the effective number of parameters in the network. This number starts out 
small and then grows during training, corresponding to a steady increase in the ef- 
fective complexity of the model. Stopping training before a minimum of the training 
error has been reached is a way to limit the effective network complexity. 

We can verify this insight for a quadratic error function and show that early stop- 
ping should exhibit similar behaviour to regularization using a simple weight-decay 
term (Bishop, 1995a). This can be understood from Figure 9.8, in which the axes 
in weight space have been rotated to be parallel to the eigenvectors of the Hessian 
matrix. If, in the absence of weight decay, the weight vector starts at the origin and 
proceeds during training along a path that follows the local negative gradient vec- 
tor, then the weight vector will move initially parallel to the wz axis through a point 
corresponding roughly to w and then move towards the minimum of the error func- 
tion wmL. This follows from the shape of the error surface and the widely differing 
eigenvalues of the Hessian. Stopping at a point near w is therefore similar to weight 
decay. The relationship between early stopping and weight decay can be made quan- 
titative, thereby showing that the quantity 77 (where 7 is the iteration index and n is 
the learning rate parameter) acts like the reciprocal of the regularization parameter À. 
The effective number of parameters in the network therefore grows during training. 
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Figure 9.8 A schematic illustration of why 
early stopping can give similar 
results to weight decay for a 
quadratic error function. The 
ellipses show contours of con- 
stant error, and w* denotes the 
maximum likelihood solution cor- 
responding to the minimum of the 
unregularized error function. If the 
weight vector starts at the origin 
and moves according to the local 
negative gradient direction, then it 
will follow the path shown by the 
curve. By stopping training early, 
a weight vector w is found that is 
qualitatively like that obtained with 
a simple weight-decay regularizer 
along with training to the minimum 
of the regularized error, as can 
be seen by comparing with Fig- 
ure 9.3. 


9.3.2 Double descent 


Section 4.3 The bias—variance trade-off provides insight into the generalization performance 
of a learnable model as the number of parameters in the model is varied. Models with 
too few parameters will have a high test set error due to the limited representational 
capacity (high bias), and as the number of parameters increases, the test error is ex- 
pected to fall. However, as the number of parameters is increased further, the test 
error increases again due to over-fitting (high variance). This leads to the conven- 
tional belief, widespread in classical statistics, that the number of parameters in the 
model needs to be limited according to the size of the data set and that for a given 
training data set, very large models are expected to have poor performance. 

Contrary to this expectation, however, modern deep neural networks can have 
excellent performance even when the number of parameters far exceeds that required 
to achieve a perfect fit to the training data (Zhang et al., 2016), and the general 
wisdom in the deep learning community is that bigger models are better. Although 
early stopping is sometimes used, models may also be trained to zero error and yet 
still have good performance on test data. 

These seemingly contradictory perspectives can be reconciled by examining 
learning curves and other plots of generalization performance versus model com- 
plexity, which reveal a more subtle phenomenon called double descent (Belkin et 
al., 2019). This is illustrated in Figure 9.9, which shows training set and test set er- 
rors versus model complexity, as determined by the number of learnable parameters, 
for a large neural network called ResNet18 (He et al., 2015a), which has 18 layers 
of parameters trained on an image classification task. The number of weights and 
biases in the network is varied by changing the ‘width parameter’, which governs 
the number of hidden units in each layer. We see that the training error decreases 
monotonically with increasing complexity of the model, as expected. However, the 
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Figure 9.9 Plot of training set and test set errors for a large neural network model called ResNet18 trained on an 
image Classification problem versus the complexity of a model. The horizontal axis represents a hyperparameter 
governing the number of hidden units and hence the overall number of weights and biases in the network. The 
vertical dashed line, labelled ‘interpolation threshold’ indicates the level of model complexity at which the model 
is capable, in principle, of achieving zero error on the training set. [From Nakkiran et al. (2019) with permission.] 


test set error decreases at first then increases again and then finally decreases again. 
This reduction in test set error for very large models continues even after the training 
set error has reached zero. 

This surprising behaviour is more complex than we would expect from the usual 
bias—variance discussion of classical statistics and exhibits two different regimes of 
model fitting, as shown schematically in Figure 9.9, corresponding to the classical 
bias—variance trade-off for small to medium complexity, followed by a further re- 
duction in test set error as we enter a regime of very large models. The transition 
between the two regimes occurs roughly when the number of parameters in the model 
is sufficiently large that the model is able to fit the training data exactly (Belkin et al., 
2019). Nakkiran et al. (2019) define the effective model complexity to be the maxi- 
mum size of training data set on which a model can achieve zero training error, and 
so double descent arises when the effective model complexity exceeds the number 
of data points in the training set. 

We see similar behaviour if we control model complexity using early stopping, 
as seen in Figure 9.10. Increasing the number of training epochs increases the ef- 
fective model complexity, and for a sufficiently large model, double descent is again 
observed. For such models there are many possible solutions including those that 
over-fit to the data. It therefore seems to be a property of stochastic gradient descent 
that the implicit biases that it introduces lead to good generalization performance. 

Analogous results are also obtained when a regularization term in the error func- 
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Figure 9.10 Plot of test set error versus number of epochs of gradient descent training for ResNet18 models 
of various sizes. The effective model complexity increases with the number of training epochs, and the double 
descent phenomenon is observed for a sufficiently large model. [From Nakkiran et al. (2019) with permission.] 
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tion is used to control complexity. Here the test set error of a large model trained 
to convergence shows double descent with respect to 1/A, the inverse regularization 
parameter, since high A corresponds to low complexity (Yilmaz and Heckel, 2022). 

One ironic consequence of double descent is that it possible to operate in a 
regime where increasing the size of the training data set could actually reduce per- 
formance, contrary to the conventional view that more data is always a good thing. 
For a model in the critical regime shown in Figure 9.9, an increase in the size of the 
training set can push the interpolation threshold to the right, leading to a higher test 
set error. This is confirmed in Figure 9.11, which shows the test set error for a trans- 
former model as a function of the dimensionality of the input space, known as the 
embedding dimension. Increasing the embedding dimension increases the number 
of weights and biases in the model and hence increases the model complexity. We 
see that increasing the training set size from 4,000 to 18,000 data points leads to a 
curve that is overall much lower. However, for a range of embedding dimensions 
that correspond to models in the critical complexity regime, increasing the size of 
the data set can actually reduce generalization performance. 


Parameter Sharing 


Regularization terms, such as the Lo regularizer ||w||?, help to reduce over-fitting by 
encouraging weight values to be close to zero. Another way to reduce network com- 
plexity is to impose hard constraints on the weights by forming them into groups and 
requiring that all weights within each group share the same value, in which the shared 
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Plot of test set error for a large transformer model versus the embedding dimension, which controls 


the number of parameters in the model. Increasing the size of the training set from 4,000 to 18,000 samples 
generally leads to a lower test set error, but for some intermediate values of model complexity, there can be an 
increase in the error, as shown by the vertical red arrows. [From Nakkiran et al. (2019) with permission.] 
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value is learned from data. This is known as weight sharing or parameter sharing or 
parameter tying. It means that the number of degrees of freedom is smaller than the 
number of connections in the network. Usually this is introduced as a way to encode 
inductive bias into a network to express some known invariances. Evaluating the 
error function gradients for such networks can be done using a small modification 
to backpropagation although in practice this is handled implicitly through automatic 
differentiation. We will make extensive use of parameter sharing when we discuss 
convolutional neural networks. Parameter sharing is applicable, however, only to 
particular problems in which the form of the constraints can be specified in advance. 


9.4.1 Soft weight sharing 


Instead of using a hard constraint that forces sets of model parameters to be 
equal, Nowlan and Hinton (1992) introduced a form of soft weight sharing in which 
a regularization term encourages groups of weights to have similar values. Further- 
more, the division of weights into groups, the mean weight value for each group, 
and the spread of values within the groups are all determined as part of the learning 
process. 

Recall that the simple-weight decay regularizer in (9.1) can be viewed as the 
negative log of a Gaussian prior distribution over the weights. This encourages all 
the weights to converge towards a single value of zero. We can instead encourage the 
weight values to form several groups, rather than just one group, by considering a 
probability distribution that is a mixture of Gaussians. The means { 4; } and variances 
{7} of the Gaussian components, as well as the mixing coefficients {7;}, will be 
considered as adjustable parameters to be determined as part of the learning process. 
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Thus, we have a probability density of the form 


K 
pw) = J [ 4 So rN (wily, o3) (9.21) 
j=l 


i 


where K is the number of components in the mixture. Taking the negative logarithm 
then leads to a regularization function of the form 


K 
Uw) =— Son | X rN (wilu o2) | - (9.22) 
i j=l 


The total error function is then given by 
E(w) = E(w) + AQ(w) (9.23) 


where À is the regularization coefficient. 

This error is minimized jointly with respect to the weights {w; } and with respect 
to the parameters {7;, 4j, Cj} of the mixture model. This can be done using gradient 
descent, which requires that we evaluate the derivatives of Q(w) with respect to all 
the learnable parameters. To do this, it is convenient to regard the {mj} as prior 
probabilities for each component to have generated a weight value, and to introduce 
the corresponding posterior probabilities, which are given by Bayes’ theorem: 


a TiN (wil 5,07) 
Li Dp THN (wiles oR) 


(9.24) 


The derivatives of the total error function with respect to the weights are then given 
by 
ðE ðE 
Ow; E Ow; 


(wi — Hy) 
+À 2 qj (wi) a (9.25) 
The effect of the regularization term is therefore to pull each weight towards the 
centre of the jth Gaussian, with a force proportional to the posterior probability of 
that Gaussian for the given weight. This is precisely the kind of effect that we are 
seeking. 

Derivatives of the error with respect to the centres of the Gaussians are also 
easily computed to give 


OE ;— Wi 
Mj 7 j 


oO. 


which has a simple intuitive interpretation, because it pushes uj towards an aver- 
age of the weight values, weighted by the posterior probabilities that the respective 
weight parameters were generated by component j. 
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To ensure that the variances {03} remain positive, we introduce new variables 
{€;} defined by 
aj = exp(§;) (9.27) 
and an unconstrained minimization is performed with respect to the {€;}. The asso- 
ciated derivatives are then given by 


JE À i — by)? 
BE 5 ut) (: e a ) (9.28) 


I 


This process drives ø; towards a weighted average of the squared deviations of the 
weights around the corresponding centre uj, where the weighting coefficients are 
again given by the posterior probability that each weight is generated by component 


For the derivatives with respect to the mixing coefficients mj, we need to take 
account of the constraints 


Soy =1, 0<m <1, (9.29) 
J 


which follow from the interpretation of the 7; as prior probabilities. This can be 
done by expressing the mixing coefficients in terms of a set of auxiliary variables 
{n; } using the softmax function given by 


exp(n;) 
So As, (9.30) 
Dk exp(ne) 


The derivatives of the regularized error function with respect to the {n; } then take 
the form 


Tj 


ðE 
g ZAD tr — (wit - (9.31) 
Nj ; 


We see that 7; is therefore driven towards the average posterior probability for mix- 
ture component 7. 

A different application of soft weight sharing (Lasserre, Bishop, and Minka, 
2006) introduces a principled approach that combines the unsupervised training of 
a generative model with the supervised training of a corresponding discriminative 
model. This is useful in situations where we have a significant amount of unlabelled 
data but where labelled data is in short supply. The generative model has the advan- 
tage that all of the data can be used to determine its parameters, whereas only the 
labelled examples directly inform the parameters of the discriminative model. How- 
ever, a discriminative model can achieve better generalization when there is model 
mis-specification, in other words when the model does not exactly describe the true 
distribution that generates the data, as is typically the case. By introducing a soft 
tying of the parameters of the two models, we obtain a well-defined hybrid of gen- 
erative and discriminative approaches that can be robust to model mis-specification 
while also benefiting from being trained on unlabelled data. 
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Figure 9.12 Plots of the Jacobian for networks with a single input and a single output, showing (a) a network 
with two layers of weights, (b) a network with 25 layers of weights, and (c) a network with 51 layers of weights 
together with residual connections. [From Balduzzi et al. (2017) with permission.] 
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The representational power of deep neural networks stems in large part from the 
use of multiple layers of processing, and it has been observed that increasing the 
number of layers in a network can increase generalization performance significantly. 
We have also seen how batch normalization, along with careful initialization of the 
weights and biases, can help address the problem of vanishing or exploding gradients 
in deep networks. However, even with batch normalization, it becomes increasingly 
difficult to train networks with a large number of layers. 

One explanation for this phenomenon is called shattered gradients (Balduzzi et 
al., 2017). We have seen that the representational capabilities of neural networks 
increase exponentially with depth. With ReLU activation functions, there is an ex- 
ponential increase in the number of linear regions that the network can represent. 
However, a consequence of this is a proliferation of discontinuities in the gradient 
of the error function. This is illustrated for networks with a single input variable 
and a single output variable in Figure 9.12. Here the derivative of the output vari- 
able with respect to the input variable (the Jacobian of the network) is plotted as 
a function of the input variable. From the chain rule of calculus, these derivatives 
determine the gradients of the error function surface. We see that for deep networks, 
extremely small changes in the weight parameters in the early layers of the network 
can produce significant changes in the gradient. Iterative gradient-based optimiza- 
tion algorithms assume that the gradient varies smoothly across parameter space, 
and hence this ‘shattered gradient’ effect can render training ineffective in very deep 
networks. 

An important modification to the architecture of neural networks that greatly as- 
sists in training very deep networks is that of residual connections (He et al., 2015a), 
which are a particular form of skip-layer connections. Consider a neural network 
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Figure 9.13 A residual network consisting of three residual blocks, corresponding to the sequence of transfor- 
mations (9.35) to (9.37). 


that consists of a sequence of three layers of the form 


zı = F; (x) (9.32) 
Z2 = F(Z) (9.33) 
y= F; (Z2). (9.34) 


Here the functions F;(-) might simply consist of a linear transformation followed 
by a ReLU activation function or they might be more complex with multiple linear, 
activation function, and normalization layers. A residual connection consists simply 
of adding the input to each function back onto the output to give 


Zı = F(x) +x (9.35) 
zə = F2(z1) + zı (9.36) 
y = F3(z2) + Z2. (9.37) 


Each combination of a function and a residual connection, such as F(x) + x, is 
called a residual block. A residual network, also known as a ResNet, consists of 
multiple layers of such blocks in sequence. A modified network with residual con- 
nections is illustrated in Figure 9.13. A residual block can easily generate the identity 
transformation, if the parameters in the nonlinear function are small enough for the 
function outputs to become close to zero. 

The term ‘residual’ refers to the fact that in each block the function learns the 
residual between the identity map and the desired output, which we can see by rear- 
ranging the residual transformation: 


F)(zi_-1) = ZI — Zj-1.- (9.38) 


The gradients in a network with residual connections are much less sensitive to input 
values compared to a standard deep network, as seen in Figure 9.12(c). 

Li et al. (2017) developed a way to visualize error surfaces directly, which 
showed that the effect of the residual connections is to create smoother error function 
surfaces, as shown in Figure 9.14. It is usual to include batch normalization layers 
in a residual network, as together they significantly reduce the issue of vanishing 
and exploding gradients. He et al. (2015a) showed that including residual connec- 
tions allows very deep networks, potentially having hundreds of layers, to be trained 
effectively. 

Further insight into the way residual connections encourage smooth error sur- 
faces can be obtained if we combine (9.35), (9.36), and (9.37) to give a single overall 
equation for the whole network: 


y= F3(Fo(Fi(x) + x) + Z1) + Zo. (9.39) 
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Figure 9.14 


(a) (b) 


(a) A visualization of the error surface for a network with 56 layers. (b) The same network with the 


inclusion of residual connections, showing the smoothing effect that comes from the residual connections. [From 
Li et al. (2017) with permission.] 
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We can now substitute for the intermediate variables zı and Zə to give an expression 
for the network output as a function of the input x: 


y = F3(Fo(Fi(x) + x) + Fi(x) +x) 
+ Fo(Fi(x) + x)) 
+ Fi(x) +x. (9.40) 


This expanded form of the residual network is depicted in Figure 9.15. We see that 
the overall function consists of multiple networks acting in parallel and that these 
include networks with fewer layers. The network has the representational capability 
of a deep network, since it contains such a network as a special case. However, the 
error surface is moderated by a combination of shallow and deep sub-networks. 

Note that the skip-layer connections defined by (9.40) require the input and all 
the intermediate variables to have the same dimensionality so that they can be added. 
We can change the dimensionality at some point in the network by including a non- 
square matrix W of learnable parameters in the form 


Zl = F)(z-1) + Wz. (9.41) 


So far we have not been specific about the form of the learnable nonlinear func- 
tions F;(-). The simplest choice would be a standard neural network that alternates 
between layers consisting of a learnable linear transformation and a fixed nonlinear 
activation function such as ReLU. This opens two possibilities for placing the resid- 
ual connections, as shown in Figure 9.16. In version (a) the quantities being added 
are always non-negative since they are given by the outputs of ReLU layers, and so 
to allow for both positive and negative values, version (b) is more commonly used. 
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Figure 9.15 The same network as in Figure 9.13, shown here in expanded form. 


9.6. Model Averaging 


If we have several different models trained to solve the same problem then instead 
of trying to select the single best model, we can often improve generalization by 
averaging the predictions made by the individual models. Such combinations of 
models are sometimes called committees or ensembles. For models that produce 
probabilistic outputs, the predicted distribution is the average of the predictions from 
each model: 


p(y|x) = Em y|x) (9.42) 


where p;(y|x) is the output of model / and L is the total number of models. 


Figure 9.16 Two alternative ways to include residual network connections into a standard feed-forward network 
that alternates between learnable linear layers and nonlinear ReLU activation functions. 
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This averaging process can be motivated by considering the trade-off between 
bias and variance. Recall from Figure 4.7 that when we trained multiple polynomials 
using the sinusoidal data and then averaged the resulting functions, the contribution 
arising from the variance term tended to cancel, leading to improved predictions. 

In practice, of course, we have only a single data set, and so we have to find 
a way to introduce variability between the different models within the committee. 
One approach is to use bootstrap data sets, in which multiple data sets are created as 
follows. Suppose our original data set consists of N data points X = {x1,...,xw}. 
We can create a new data set Xp by drawing N points at random from X, with 
replacement, so that some points in X may be replicated in Xp, whereas other points 
in X may be absent from Xp. This process can be repeated L times to generate L 
data sets each of size N and each obtained by sampling from the original data set X. 
Each data set can then be used to train a model, and the predictions of the resulting 
models are averaged. This procedure is known as bootstrap aggregation or bagging 
(Breiman, 1996). An alternative approach to forming an ensemble is to use the 
original data set to train multiple different models having different architectures. 

We can analyse the benefits of ensemble predictions by considering a regression 
problem with an input vector x and a single output variable y. Suppose we have a 


set of trained models y;(x),..., ym (X), and we form a committee prediction given 
by 
14 
ycom(x) = M 2 Ym(xX). (9.43) 


If the true function that we are trying to predict is given by h(x), then the output of 
each of the models can be written as the true value plus an error: 


Ym (x) = h(x) + €m(x). (9.44) 


The average sum-of-squares error then takes the form 


Ex {Yim (x) = r(x)}"] = Ex [em (x)”] (9.45) 


where E,|-] denotes a frequentist expectation with respect to the distribution of the 
input vector x. The average error made by the models acting individually is therefore 


1 
Ew = — X Ex [€m(x)?] - (9.46) 


m=1 


Similarly, the expected error from the committee (9.43) is given by 


1 2 
Ecom = Ex DRG 


1“ E 
= Ez E ents) : (9.47) 
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If we assume that the errors have zero mean and are uncorrelated, so that 


Ex lém(x)| = 0 (9.48) 
Bx [Em(x)e(x)] = 0, mÆl (9.49) 

then we obtain i 
Ecom = MEN (9.50) 


This apparently dramatic result suggests that the average error of a model can be 
reduced by a factor of M simply by averaging M versions of the model. Unfortu- 
nately, it depends on the key assumption that the errors due to the individual models 
are uncorrelated. In practice, the errors are typically highly correlated, and the re- 
duction in the overall error is generally much smaller. It can, however, be shown that 
the expected committee error will not exceed the expected error of the constituent 
models, so that Ecom < Eav. 

A somewhat different approach to model combination, known as boosting (Fre- 
und and Schapire, 1996), combines multiple ‘base’ classifiers to produce a form of 
committee whose performance can be significantly better than that of any of the base 
classifiers. Boosting can give good results even if the base classifiers perform only 
slightly better than random. The principal difference between boosting and the com- 
mittee methods, such as bagging as discussed above, is that the base classifiers are 
trained in sequence and each base classifier is trained using a weighted form of the 
data set in which the weighting coefficient associated with each data point depends 
on the performance of the previous classifiers. In particular, points that are misclas- 
sified by one of the base classifiers are given a greater weight when used to train 
the next classifier in the sequence. Once all the classifiers have been trained, their 
predictions are then combined through a weighted majority voting scheme. 

In practice, the major drawback of all model combination methods is that mul- 
tiple models have to be trained and then predictions have to be evaluated for all the 
models, thereby increasing the computational cost of both training and inference. 
How significant this depends on the specific application scenario. 


9.6.1 Dropout 


A widely used and very effective form of regularization known as dropout (Sri- 
vastava et al., 2014) can be viewed as an implicit way to perform approximate model 
averaging over exponentially many models without having to train multiple models 
individually. It has broad applicability and is computationally cheap. Dropout is one 
of the most effective forms of regularization and is widely used in applications. 

The central idea of dropout is to delete nodes from the network, including their 
connections, at random during training. Each time a data point is presented to the 
network, a new random choice is made for which nodes to omit. Figure 9.17 shows 
a simple network along with examples of pruned networks in which subsets of nodes 
have been omitted. 

Dropout is applied to both hidden nodes and input nodes, but not outputs, and is 
equivalent to setting the output of a dropped node to zero. It can be implemented by 
defining a mask vector R; € {0, 1} which multiplies the activation of the non-output 
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inputs 


outputs 


Figure 9.17 A neural network on the left along with two examples of pruned networks in which a random subset 
of nodes have been omitted. 
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node i for data point n, whose values are set to 1 with probability p. A value of 
p = 0.5 seems to work well for the hidden nodes, whereas for the inputs a value of 
p = 0.8 is typically used. 

During training, as each data point is presented to the network, a new mask is 
created, and the forward and backward propagation steps are applied on that pruned 
network to create error function gradients, which are then used to update the weights, 
for example by stochastic gradient descent. If the data points are grouped into mini- 
batches then the gradients are averaged over the data points in each mini-batch before 
applying the weight update. For a network with M non-output nodes, there are 2“ 
pruned networks, and so only a small fraction of these networks will ever be con- 
sidered during training. This differs from conventional ensemble methods in which 
each of the networks in the ensemble is independently trained to convergence. An- 
other difference is that the exponentially many networks that are implicitly being 
trained with dropout are not independent but share their parameter values with the 
full network, and hence with each other. Note that training can take longer with 
dropout since the individual parameter updates are very noisy. Also, because the 
error function is intrinsically noisy, it is harder to confirm that the optimization al- 
gorithm is working correctly just by looking for a decreasing error function during 
training. 

Once training is complete, predictions can in principle be made by applying the 
ensemble rule (9.42), which in this case takes the form 


p(ylx) = X` p(R)p(y|x, R) (9.51) 


where the sum is over the exponentially large space of masks, and p(y|x, R) is the 
predictive distribution from the network with mask R. Because this summation is 
intractable, it can be approximated by sampling a small number of masks, and in 
practice, as few as 10 or 20 masks can be sufficient to obtain good results. This 
procedure is known as Monte Carlo dropout. 


Section 2.6 


Exercise 9.18 


Exercises 
9.1 


Section 9.1.3 


9.2 


Exercises 281 


An even simpler approach is to make predictions using the trained network with 
no nodes masked out, and to re-scale the weights in the network so that the expected 
input to each node is roughly the same during testing as it would be during training, 
compensating for the fact that in training a proportion of the nodes would be missing. 
Thus, if a node is present with probability p during training, then during testing the 
output weights from that node would be multiplied by p before using the network to 
make predictions. 

A different motivation for dropout comes from the Bayesian perspective. In a 
fully Bayesian treatment, we would make predictions by averaging over all possible 
2M network models, with each network weighted by its posterior probability. Com- 
putationally, this would be prohibitively expensive, both during training when eval- 
uating the posterior probabilities and during testing when computing the weighted 
predictions. Dropout approximates this model averaging by giving an equal weight 
to each possible model. 

Further intuition behind dropout comes from its role in reducing over-fitting. In 
a standard network, the parameters can become tuned to noise on individual data 
points, with hidden nodes becoming over-specialized. Each node adjusts its weights 
to minimize the error, given the outputs of other nodes, leading to co-adaptation of 
nodes in a way that might not generalize to new data. With dropout, each node 
cannot rely on the presence of other specific nodes and must instead make useful 
contributions in a broad range of contexts, thereby reducing co-adaptation and spe- 
cialization. For a simple linear regression model trained using least squares, dropout 
regularization is equivalent to a modified form of quadratic regularization. 


(x) By considering each of the four group axioms in turn, show that the set of all pos- 
sible rotations of a square through (positive or negative) multiples of 90°, together 
with the binary operation of composing rotations, forms a group. Similarly, show 
that the set of all continuous translations of an object in a two-dimensional plane 
also forms a group. 


(x x) Consider a linear model of the form 


D 
y(X,w) = wo + >) wits (9.52) 


i=1 


together with a sum-of-squares error function of the form 


1 N 
Ep(w) = 5 X {y(Xn, Ww) — tn}. (9.53) 
n=1 


Now suppose that Gaussian noise €; with zero mean and variance ø? is added in- 
dependently to each of the input variables xz;. By making use of Efe;] = 0 and 
Eleiej] = di; o”, show that minimizing Ep averaged over the noise distribution is 
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9.3 


9.4 


9.5 


9.6 


equivalent to minimizing the sum-of-squares error for noise-free input variables with 
the addition of a weight-decay regularization term, in which the bias parameter wo 
is omitted from the regularizer. 


(x x) Consider an error function that consists simply of the quadratic regularizer 
Q(w) = —~w'w (9.54) 


together with the gradient descent update formula 
wD = wt) _ nVO(w). (9.55) 


By considering the limit of infinitesimal updates, write down a corresponding dif- 
ferential equation for the evolution of w. Write down the solution of this equation 
starting from an initial value wo, and show that the elements of w decay exponen- 
tially to zero. 


(x) Verify that the network function defined by (9.6) and (9.7) is invariant under 
the transformation (9.8) applied to the inputs, provided the weights and biases are 
simultaneously transformed using (9.9) and (9.10). Similarly, show that the network 
outputs can be transformed according to (9.11) by applying the transformation (9.12) 
and (9.13) to the second-layer weights and biases. 


(xx) By using Lagrange multipliers, show that minimizing the regularized error 
function given by (9.19) is equivalent to minimizing the unregularized error func- 
tion E(w) subject to the constraint (9.20). Discuss the relationship between the 
parameters 77 and À. 


(x x x) Consider a quadratic error function of the form 
1 
E = Eo + 5 (Ww — w*)H(w — w*) (9.56) 


where w* represents the minimum, and the Hessian matrix H is positive definite and 
constant. Suppose the initial weight vector w® is chosen to be at the origin and is 
updated using simple gradient descent: 


w) =w) _ pVE (9.57) 


where 7 denotes the step number, and p is the learning rate (which is assumed to be 
small). Show that, after 7 steps, the components of the weight vector parallel to the 
eigenvectors of H can be written 


wh) = {1—(1— pn;)" fu; (9.58) 


where w; = wuj, and u; and 7; are the eigenvectors and eigenvalues of H, re- 
spectively, defined by 
Huj = Njuj. (9.59) 
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9.8 


9.9 
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9.11 
9.12 


9.13 


9.14 
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Show that as T — 00, this gives w'7) —> w* as expected, provided |1 — pnj| < 1. 

Now suppose that training is halted after a finite number 7 of steps. Show that the 
components of the weight vector parallel to the eigenvectors of the Hessian satisfy 

wy) xw; when nj; > (pt)* (9.60) 

Jw” | <|wi| when n; < (pr). (9.61) 

This result shows that (p7)~* plays an analogous role to the regularization parameter 


A in weight decay. 


(x x) Consider a neural network in which multiple weights are constrained to have the 
same value. Discuss how the standard backpropagation algorithm must be modified 
to ensure that such constraints are satisfied when evaluating the derivatives of an 
error function with respect to the adjustable parameters in the network. 


(x) Consider a mixture distribution defined by 


-5 TiN (w\uj, o?) (9.62) 


in which {7} can be viewed as prior probabilities p(j) for the corresponding Gaus- 
sian components. Using Bayes’ theorem, show that the corresponding posterior 
probabilities p(j|w) are given by (9.24). 


(x x) Using (9.21), (9.22), (9.23), and (9.24) verify the result (9.25). 
(x x) Using (9.21), (9.22), (9.23), and (9.24) verify the result (9.26). 
(x x) Using (9.21), (9.22), (9.23), and (9.24) verify the result (9.28). 


(x x) Show that the derivatives of the mixing coefficients {7;,} defined by (9.30) with 
respect to the auxiliary parameters {7,;} are given by 


On 

a = Ojklj — NTjTk. (9.63) 
j 

Hence, by making use of the constraint X`, yYk(w;) = 1 for all ¿, derive the result 

(9.31). 


(x) Verify that combining (9.35), (9.36), and (9.37) gives a single overall equation 
for the whole network in the form (9.40). 


(x x) The expected sum-of-squares error Fay for a simple committee model can be 
defined by (9.46), and the expected error of the committee itself is given by (9.47). 
Assuming that the individual errors satisfy (9.48) and (9.49), derive the result (9.50). 
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9.15 


9.16 


9.17 


9.18 


(x x) By making use of Jensen’s inequality (2.102) for the special case of the convex 
function f(x) = x°, show that the average expected sum-of-squares error Fay of 
the members of a simple committee model, given by (9.46), and the expected error 
Ecom of the committee itself, given by (9.47), satisfy 


Ecom <S Eav. (9.64) 


(xx) By making use of Jensen’s in equality (2.102), show that the result (9.64) de- 
rived in the previous exercise holds for any error function E (y), not just sum-of- 
squares, provided it is a convex function of y. 


(x x) Consider a committee in which we allow unequal weighting of the constituent 
models, so that 


Yocom (x) = 2. AmYm(X). (9.65) 


To ensure that the predictions ycom(x) remain within sensible limits, suppose that 
we require that they be bounded at each value of x by the minimum and maximum 
values given by any of the members of the committee, so that 


Ymin(X) < Ycom(x) < Ymax (X). (9.66) 


Show that a necessary and sufficient condition for this constraint is that the coeffi- 
cients a, satisfy 


i 20, Y eas (9.67) 


(x x x) Here we explore the effect of dropout regularization on a simple linear regres- 
sion model trained using least squares. Consider a model of the form 


D 
i=1 


along with a sum-of-squares error function given by 


N K D 2 
n=1 k=1 sar 


where the elements Rn; € {0,1} of the dropout matrix are chosen randomly from 
a Bernoulli distribution with parameter p. We now take an expectation over the 
distribution of random dropout parameters. Show that 


E[Rni] = p (9.70) 
E[RniRnj] = ijp + (1 = 6i;)p- (9.71) 
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Hence, show that the expected error function for this dropout model is given by 


E [E(W)] = > ` fum =P 3 unani} (9.72) 


n=1 k= 


N KD 
+ p(1 — p) » > > Whilni- (9.73) 


Thus, we see that the expected error function corresponds to a sum-of-squares error 
with a quadratic regularizer in which the regularization coefficient is scaled sepa- 
rately for each input variable according to the data values seen by that input. Finally, 
write down a closed-form solution for the weight matrix that minimizes this regular- 
ized error function. 


Chapter 12 


Check for 
updates 


10 


Convolutional 
Networks 


The simplest machine learning models assume that the observed data values are un- 
structured, meaning that the elements of the data vectors x = (a ,...,Up) are 
treated as if we do not know anything in advance about how the individual elements 
might relate to each other. If we were to make a random permutation of the ordering 
of these variables and apply this fixed permutation consistently on all training and 
test data, there would be no difference in the performance for the models considered 
so far. 

Many applications of machine learning, however, involve structured data in 
which there are additional relationships between input variables. For example, the 
words in natural language form a sequence, and if we were to model language as a 
generative autoregressive process then we would expect each word to depend more 
strongly on the immediately preceding words and less so on words much earlier in 
the sequence. Likewise, the pixels of an image have a well-defined spatial relation- 
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Chapter 12 


Figure 1.1 


Figure 10.19 


Figure 10.26 


Figure 12.27 


Figure 1.3 


ship to each other in which the input variables are arranged in a two-dimensional 
grid, and nearby pixels have highly correlated values. 

We have already seen that our knowledge of the structure of specific data modal- 
ities can be utilized through the addition of a regularization term to the error function 
in the training objective, through data augmentation, or through modifications to the 
model architecture. These approaches can help guide the model to respect certain 
properties such as invariance and equivariance with respect to transformations of the 
input data. In this chapter we will take a look at an architectural approach called a 
convolutional neural network (CNN), which we will see can be viewed as a sparsely 
connected multilayer network with parameter sharing, and designed to encode in- 
variances and equivariances specific to image data. 


Computer Vision 


The automatic analysis and interpretation of image data form the focus of the field 
of computer vision and represent a major application area for machine learning 
(Szeliski, 2022). Historically, computer vision was based largely on 3-dimensional 
projective geometry. Hand-crafted features were constructed and used as input to 
simple learning algorithms (Hartley and Zisserman, 2004). However, it was one 
of the first fields to be transformed by the deep learning revolution, predominantly 
thanks to the CNN architecture. Although the architecture was originally developed 
in the context of image analysis, it has also been applied in other domains such as the 
analysis of sequential data. Recently alternative architectures based on transformers 
have become competitive with convolutional networks in some applications. 

There are many applications for machine learning in computer vision, of which 
some of the most commonly encountered are the following: 


1. Classification of images, for example classifying an image of a skin lesion as 
benign or malignant. This is sometimes called ‘image recognition’. 


2. Detection of objects in an image and determining their locations within the 
image, for example detecting pedestrians from camera data collected by an 
autonomous vehicle. 


3. Segmentation of images, in which each pixel is classified individually thereby 
dividing the image into regions sharing a common label. For example, a nat- 
ural scene might be segmented into sky, grass, trees, and buildings, whereas 
a medical scan image could be segmented into cancerous tissue and normal 
tissue. 


4. Caption generation in which a textual description is generated automatically 
from an image. 


5. Synthesis of new images, for example generating images of human faces. Im- 
ages can also be synthesized based on a text input describing the desired image 
content. 


Figure 20.9 


Figure 10.32 


Figure 20.8 


Figure 6.8 
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6. Inpainting in which a region of an image is replaced with synthesized pixels 
that are consistent with the rest of the image. This is used, for example, to 
remove unwanted objects during image editing. 


7. Style transfer in which an input image in one style, for example a photograph, 
is transformed into a corresponding image in a different style, for example an 
oil painting. 


8. Super-resolution in which the resolution of an image is improved by increas- 
ing the number of pixels and synthesizing associated high-frequency informa- 
tion. 


9. Depth prediction in which one or more views are used to predict the distance 
of the scene from the camera at each pixel in a target image. 


10. Scene reconstruction in which one or more two-dimensional images of a 
scene are used to reconstruct a three-dimensional representation. 


10.1.1 Image data 


An image comprises a rectangular array of pixels, in which each pixel has either 
a grey-scale intensity or more commonly a triplet of red, green, and blue channels 
each with its own intensity value. These intensities are non-negative numbers that 
also have some maximum value corresponding to the limits of the camera or other 
hardware device used to capture the image. For the most part, we will view the in- 
tensities as continuous variables, but in practice they are represented with finite pre- 
cision, for example as 8-bit numbers represented as integers in the range 0,... , 255. 
Some images, such as the magnetic resonance imaging (MRI) scans used in medical 
diagnosis, comprise three-dimensional grids of voxels. Similarly, videos comprise 
a sequence of two-dimensional images and therefore can also be viewed as three- 
dimensional structures in which successive frames are stacked through time. 

Now consider the challenge of applying neural networks to image data to ad- 
dress some of the applications highlighted above. Images generally have a high 
dimensionality, with typical cameras capturing images comprising tens of megapix- 
els. Treating the image data as unstructured may therefore require a model with a 
vast number of parameters that would be infeasible to train. More significantly, such 
an approach fails to take account of the highly structured nature of image data, in 
which the relative positions of different pixels play a crucial role. We can see this 
because if we take the pixels of an image and randomly permute them, then the result 
no longer looks like a natural image. Similarly, if we generate a synthetic image by 
drawing random values for the pixel intensities independently for each pixel, there 
is essentially zero chance of generating something that looks like a natural image. 
Local correlations are important, and in a natural image there is a much higher prob- 
ability that two nearby pixels will have similar colours and intensities compared to 
two pixels that are far apart. This represents powerful prior knowledge and can be 
used to encode strong inductive biases into a neural network, leading to models with 
far fewer parameters and with much better generalization accuracy. 


290 


10. CONVOLUTIONAL NETWORKS 


10.2. Convolutional Filters 


One motivation for the introduction of convolutional networks is that for image data, 
which is the modality for which CNNs were designed, a standard fully connected 
architecture would require vast numbers of parameters due to the high-dimensional 
nature of images. To see this, consider a colour image with 10° x 10° pixels, each 
with three values corresponding to red, green, and blue intensities. If the first hidden 
layer of the network has, say, 1,000 hidden units, then we already have 3 x 10° 
weights in the first layer. Furthermore, such a network would have to learn any 
invariances and equivariances by example, which would require huge data sets. By 
designing an architecture that incorporates our inductive bias about the structure 
of images, we can reduce the data set requirements dramatically and also improve 
generalization with respect to symmetries in the image space. 

To exploit the two-dimensional structure of image data to create inductive bi- 
ases, we can use four interrelated concepts: hierarchy, locality, equivariance, and 
invariance. Consider the task of detecting faces in images. There is a natural hier- 
archical structure because one image may contain several faces, and each face in- 
cludes elements such as eyes, and each eye has structure such as an iris, which itself 
has structure such as edges. At the lowest level of the hierarchy, a node in a neural 
network could detect the presence of a feature such as an edge using information 
that is local to a small region of an image, and therefore it would only need to see 
a small subset of the image pixels. More complex structures further up the hierar- 
chy can be detected by composing multiple features found at previous levels. A key 
point, however, is that although we want to build the general concept of hierarchy 
into the model, we want the details of the hierarchy, including the type of features 
extracted at each level, to be learned from data, not hand-coded. Hierarchical models 
fit naturally within the deep learning framework, which already allows very complex 
concepts to be extracted from raw data through a succession of, possibly very many, 
‘layers’ of processing, in which the whole system is trained end-to-end. 


10.2.1 Feature detectors 


For simplicity we will initially restrict our attention to grey-scale images (i.e., 
ones having a single channel). Consider a single unit in the first layer of a neural 
network that takes as input just the pixel values from a small rectangular region, 
or patch, from the image, as illustrated in Figure 10.1(a). This patch is referred to 
as the receptive field of that unit, and it captures the notion of locality. We would 
like weight values associated with this unit to learn to detect some useful low-level 
feature. The output of this unit is given by the usual functional form comprising a 
weighted linear combination of the input values, which is subsequently transformed 
using a nonlinear activation function: 


z = ReLU(w'x + wo) (10.1) 


where x is a vector of pixel values for the receptive field, and we have assumed a 
ReLU activation function. Because there is one weight associated with each input 


Figure 10.1 


Exercise 10.1 
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(a) Illustration of a re- image 
ceptive field, showing a unit in a hid- 
den layer of a network that receives 
input from pixels in a 3 x 3 patch of 
the image. Pixels in this patch form 
the receptive field for this unit. (b) 
The weight values associated with 
this hidden unit can be visualized as 
a small 3 x 3 matrix, known as a ker- 
nel. There is also an additional bias 
parameter that is not shown here. 


hidden units 


(a) (b) 


pixel, the weights themselves form a small two-dimensional grid known as a filter, 
sometimes also called a kernel, which itself can be visualized as an image. This is 
illustrated in Figure 10.1(b). 

Suppose that w and wọ in (10.1) are fixed and we ask for which value of the 
input image patch x will this hidden unit give the largest output response. To answer 
this we need to constrain x in some way, so let us suppose that its norm ||x||? is fixed. 
Then the solution for x that maximizes w'x, and hence maximizes the response of 
the hidden unit, is of the form x = aw for some coefficient a. This says that 
the maximum output response from this hidden unit occurs when it detects a patch 
of image that, up to an overall scaling, looks like the kernel image. Note that the 
ReLU generates a non-zero output only when wx exceeds a threshold of —wo, and 
therefore the unit acts as a feature detector that signals when it finds a sufficiently 
good match to its kernel. 


10.2.2 Translation equivariance 


Next note that if a small patch in a face image represents, for example, an eye 
at that location, then the same set of pixel values in a different part of the image 
must represent an eye at the new location. Our neural network needs to be able to 
generalize what it has learned in one location to all possible locations in the image, 
without needing to see examples in the training set of the corresponding feature at 
every possible location. To achieve this, we can simply replicate the same hidden- 
unit weight values at multiple locations across the image, as illustrated for a one- 
dimensional input space in Figure 10.2. 

The units of the hidden layer form a feature map in which all the units share 
the same weights. Consequently if a local patch of an image produces a particular 
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Exercise 10.4 


Section 7.4.2 


Illustration of convolution for a one-dimensional array of 
input values and a kernel of width 2. The connections 


are sparse and are shared by the hidden units, as shown 


by the red and blue arrows in which links with the same © 
colour have the same weight values. This network there- 
fore has six connections but only two independent learn- C) 


able parameters. 


response in the unit connected to that patch, then the same set of pixel values at 
a different location will produce the same response in the corresponding translated 
location in the feature map. This is an example of equivariance. We see that the 
connections in this network are sparse in that most connections are absent. Also, the 
values of the weights are shared by all the hidden units, as indicated by the colours 
of the connections. This transformation is an example of a convolution. 

We can extend the idea of convolution to two-dimensional images as follows 
(Dumoulin and Visin, 2016). For an image I with pixel intensities [(j,k), and a 
filter K with pixel values K (l, m), the feature map C has activation values given by 


Clk) =X X_IG +1,k + m)K (l,m) (10.2) 
l m 


where we have omitted the nonlinear activation function for clarity. This again is 
an example of a convolution and is sometimes expressed as C = I « K. Note that 
strictly speaking (10.2) is called a cross-correlation, which differs slightly from the 
conventional mathematical definition of convolution, but here we will follow com- 
mon practice in the machine learning literature and refer to (10.2) as a convolution. 
The relationship (10.2) is illustrated in Figure 10.3 for a 3 x 3 image and a 2 x 2 
filter. Importantly, when using batch normalization in a convolutional network, the 
same value of mean and variance must be used at every spatial location within a fea- 
ture map when normalizing the states of the units to ensure that the statistics of the 
feature map are independent of location. 

As an example of the application of convolution, we consider the problem of 
detecting edges in images using a fixed, hand-crafted convolutional filter. Intuitively, 
we can think of a vertical edge as occurring when there is a significant local change 
in the intensity between pixels as we move horizontally across the image. We can 
measure this by convolving the image with a 3 x 3 filter of the form 


(10.3) 
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Figure 10.3 Example of a 3 x 3 image convolved with a 2 x 2 filter to give a resulting 2 x 2 feature map. 


Similarly we can detect horizontal edges by convolving with the transpose of this 
filter: 


(10.4) 


Figure 10.4 shows the results of applying these two convolutional filters to a sample 
image. Note that in Figure 10.4(b) if a vertical edge corresponds to an increase in 
pixel intensity, the corresponding point on the feature map is positive (indicated by a 
light colour), whereas if the vertical edge corresponds to a decrease in pixel intensity, 
the corresponding point on the feature map is negative (indicated by a dark colour), 
with analogous properties for Figure 10.4(c) for horizontal edges. 

Comparing this convolutional structure with a standard fully connected net- 
work, we see several advantages: (i) the connections are sparse, leading to far fewer 
weights even with large images, (ii) the weight values are shared, greatly reducing 
the number of independent parameters and consequently reducing the required size 


Figure 10.4 Illustration of edge detection using convolutional filters showing (a) the original image, (b) the result 
of convolving with the filter (10.3) that detects vertical edges, and (c) the result of convolving with the filter (10.4) 
that detects horizontal edges. 
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Exercise 10.6 


Exercise 10.7 


of the training set needed to learn those parameters, and (iii) the same network can 
be applied to images of different sizes without the need for retraining. We will re- 
turn to this final point later, but for the moment, simply note that changing the size 
of the input image simply changes the size of the feature map but does not change 
the number of weights, or the number of independent learnable parameters, in the 
model. One final observation regarding convolutional networks is that, by exploiting 
the massive parallelism of graphics processing units (GPUs) to achieve high compu- 
tational throughput, convolutions can be implemented very efficiently. 


10.2.3 Padding 


We see from Figure 10.3 that the convolution map is smaller than the original 
image. If the image has dimensionality J x K pixels and we convolve with a kernel 
of dimensionality M x M (filters are typically chosen to be square) the resulting 
feature map has dimensionality (J — M + 1) x (K — M + 1). In some cases 
we want the feature map to have the same dimensions as the original image. This 
can be achieved by padding the original image with additional pixels around the 
outside, as illustrated in Figure 10.5. If we pad with P pixels then the output map has 
dimensionality (J+2P—M+1)x(kK+2P-—M-+1). If there is no padding, so that 
P = 0, this is called a valid convolution. When the value of P is chosen such that the 
output array has the same size as the input, corresponding to P = (M — 1)/2, this 
is called a same convolution, because the image and the feature map have the same 
dimensions. In computer vision, filters generally use odd values of M, so that the 
padding can be symmetric on all sides of the image and that there is a well-defined 
central pixel associated with the location of the filter. Finally, we have to choose a 
suitable value for the intensities associated with the padding pixels. A typical choice 
is to set the padding values to zero, after first subtracting the mean from each image 
so that zero represents the average value of the pixel intensity. Padding can also be 
applied to feature maps for processing by subsequent convolutional layers. 


10.2.4 Strided convolutions 


In typical image processing applications, the images can have very large num- 
bers of pixels, and since the kernels are often relatively small, so that M <«< J, K, the 
convolutional feature map will be of a similar size to the original image and will be 
the same size if same padding is used. Sometimes we wish to use feature maps that 
are significantly smaller than the original image to provide flexibility in the design 
of convolutional network architectures. One way to achieve this is to use strided 
convolutions in which, instead of stepping the filter over the image one pixel at a 
time, it is moved in larger steps of size S, called the stride. If we use the same stride 
horizontally and vertically, then the number of elements in the feature map will be 


a | M | 
1| x 1 


3 g (10.5) 


where |x] denotes the ‘floor’ of x, i.e., the largest integer that is less than or equal 
to x. For large images and small filter sizes, the image map will be roughly a factor 
of 1/S smaller than the original image. 
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Figure 10.5 Illustration of a 4 x 4 image that has been padded 
with additional pixels to create a 6 x 6 image. 


10.2.5 Multi-dimensional convolutions 


So far we have considered convolutions over a single grey-scale image. For a 
colour image there will be three channels corresponding to the red, green, and blue 
colours. We can easily extend convolutions to cover multiple channels by extending 
the dimensionality of the filter. An image with J x K pixels and C channels will 

Section 6.3.7 be described by a tensor of dimensionality J x K x C. We can introduce a filter 
described by a tensor of dimensionality M x M x C comprising a separate M x M 
filter for each of the C channels. Assuming no padding and a stride of 1, this again 
gives a feature map of size (J— M +1) x (i —M +1), as is illustrated in Figure 10.6. 


image 


hidden units 


(a) (b) 


Figure 10.6 (a) Illustration of a multi-dimensional filter that takes input from across the R, G, and B channels. 
(b) The kernel here has 27 weights (plus a bias parameter not shown) and can be visualized as a3 x 3 x 3 
tensor. 
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Figure 10.7 The multi-dimensional convolu- 
tional filter layer shown in Figure 10.6 can be 
extended to include multiple independent filter 


channels. 
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We now make a further important extension to convolutions. Up to now we 
have created a single feature map in which all the points in the feature map share the 
same set of learnable parameters. For a filter of dimensionality M x M x C, this 
will have M?C weight parameters, irrespective of the size of the image. In addition 
there will be a bias parameter associated with this unit. Such a filter is analogous 
to a single hidden node in a fully connected network, and it can learn to detect only 
one kind of feature and is therefore very limited. To build more flexible models, 
we simply include multiple such filters, in which each filter has its own independent 
set of parameters giving rise to its own independent feature map, as illustrated in 
Figure 10.7. We will again refer to these separate feature maps as channels. The 
filter tensor now has dimensionality M x M x C x Cour where C is the number 
of input channels and Cour is the number of output channels. Each output channel 
will have its own associated bias parameter, so the total number of parameters will 
be (M?C + 1)Cour. 

A useful concept in designing convolutional networks is the 1 x 1 convolution 
(Lin, Chen, and Yan, 2013), which is simply a convolutional layer in which the filter 
size is a single pixel. The filters have C weights, one for each input channel, plus 
a bias. One application for 1 x 1 convolutions is simply to change the number of 
channels (typically to reduce the number of channels) without changing the size of 
the feature maps, by setting the number of output channels to be different to the 
number of input channels. It is therefore complementary to strided convolutions or 
pooling in that it reduces the number of channels rather than the dimensionality of 
the channels. 


10.2.6 Pooling 


A convolutional layer encodes translation equivariance, whereby if a small patch 
of pixels, representing the receptive field of a hidden unit, is moved to a different 
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Figure 10.8 Illustration of max-pooling 
in which blocks of 2 x 2 pixels in a fea- 
ture map are combined using the ‘max’ 
operator to generate a new feature map 
of smaller dimensionality. 


location in the image, the associated outputs of the feature map will move to the 
corresponding location in the feature map. This is valuable for applications such as 
finding the location of an object within an image. For other applications, such as 
classifying an image, we want the output to be invariant to translations of the input. 
In all cases, however, we want the network to be able to learn hierarchical structure 
in which complex features at a particular level are built up from simpler features 
at the previous level. In many cases the spatial relationship between those simpler 
features will be important. For example, it is the relative positions of the eyes, nose, 
and mouth that help determine the presence of a face and not just the presence of 
these features in arbitrary locations within the image. However, small changes in the 
relative locations do not affect the classification, and we want to be invariant to such 
small translations of individual features. This can be achieved using pooling applied 
to the output of the convolutional layer. 

Pooling has similarities to using a convolutional layer in that an array of units is 
arranged in a grid, with each unit taking input from a receptive field in the previous 
feature map layer. Again, there is a choice of filter size and of stride length. The 
difference, however, is that the output of a pooling unit is a simple, fixed function of 
its inputs, and so there are no learnable parameters in pooling. A common example 
of a pooling function is max-pooling (Zhou and Chellappa, 1988) in which each unit 
simply outputs the max function applied to the input values. This is illustrated with 
a simple example in Figure 10.8. Here the stride length is equal to the filter width, 
and so there is no overlap of the receptive fields. 

As well as building in some local translation invariance, pooling can also be 
used to reduce the dimensionality of the representation by down-sampling the feature 
map. Note that using strides greater than 1 in a convolutional layer also has the effect 
of down-sampling the feature maps. 

We can interpret the activation of a unit in a feature map as a measure of the 
strength of detection of a corresponding feature, so that the max-pooling preserves 
information on whether the feature is present and with what strength but discards 
some positional information. There are many other choices of pooling function, for 
example average pooling in which the pooling function computes the average of the 
values from the corresponding receptive field in the feature map. These all introduce 
some degree of local translation invariance. 

Pooling is usually applied to each channel of a feature map independently. For 
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example, if we have a feature map with 8 channels, each of dimensionality 64 x 64, 
and we apply max-pooling with a receptive field of size 2 x 2 and a stride of 2, the 
output of the pooling operation will be a tensor of dimensionality 32 x 32 x 8. 

We can also apply pooling across multiple channels of a feature map, which 
gives the network the potential to learn other invariances beyond simple translation 
invariance. For example, if several channels in a convolutional layer learn to detect 
the same feature but at different orientations, then max-pooling across those feature 
maps will be approximately invariant to rotations. 

Pooling also allows a convolutional network to process images of varying sizes. 
Ultimately, the output, and generally some of the intermediate layers, of a convolu- 
tional network must have a fixed size. Variable-sized inputs can be accommodated 
by varying the stride length of the pooling according to the size of the image such 
that the number of pooled outputs remains constant. 


10.2.7 Multilayer convolutions 


The convolutional network structure described so far is analogous to a single 
layer in a standard fully connected neural network. To allow the network to discover 
and represent hierarchical structure in the data, we now extend the architecture by 
considering multiple layers of the kind described above. Each convolutional layer is 
described by a filter tensor of dimensionality M x M x Cin x Cour in which the 
number of independent weight and bias parameters is (M?C}y +1)Cour. Each such 
convolutional layer can optionally be followed by a pooling layer. We can now apply 
multiple such layers of convolution and pooling in succession, in which the Cour 
output channels of a particular layer, analogous to the RGB channels of the input 
image, form the input channels of the next layer. Note that the number of channels 
in a feature map is sometimes called the ‘depth’ of the feature map, but we prefer to 
reserve the term depth to mean the number of layers in a multilayer network. 

A key property that we built into the convolutional framework is that of locality, 
in which a given unit in a feature map takes information only from a small patch, the 
receptive field, in the previous layer. When we construct a deep neural network in 
which each layer is convolutional then the effective receptive field of a unit in later 
layers in the network becomes much larger than those in earlier layers, as seen in 
Figure 10.9. 

In many applications, the output units of the network need to make predictions 
about the image as a whole, for example in a classification task, and so they need 
to combine information from across the whole of the input image. This is typically 
achieved by introducing one or two standard fully connected layers as the final stages 
of the network, in which each unit is connected to every unit in the previous layer. 
The number of parameters in such an architecture can be manageable because the fi- 
nal convolutional layer generally has much lower dimensionality than the input layer 
due to the intermediate pooling layers. Nevertheless, the final fully connected layers 
may contain the majority of the independent degrees of freedom in the network even 
if the number of (shared) connections in the network is larger in the convolutional 
layers. 

A complete CNN therefore comprises multiple layers of convolutions inter- 


Figure 10.9 
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Illustration of how the effective re- 
ceptive field grows with depth in 
a multilayer convolutional network. 
Here we see that the red unit at 
the top of the output layer takes 
inputs from a receptive field in 
the middle layer of units, each of 
which has a receptive field in the 
first layer of units. Thus, the ac- 
tivation of the red unit in the out- 
put layer depends on the outputs 
of 3 units in the middle layer and 5 
units in the input layer. 


receptive field 


spersed with pooling operations, and often with conventional fully connected layers 
in the final stages of the network. There are many choices to be made in designing 
such an architecture including the number of layers, the number of channels in each 
layer, the filter sizes, the stride widths, and multiple other such hyperparameters. A 
wide variety of different architectures have been explored, although in practice it is 
difficult to make a systematic comparison of hyperparameter values using hold-out 
data due to the high computational cost of training each candidate configuration. 


10.2.8 Example network architectures 


Convolutional networks were the first deep neural networks (i.e., ones with 
more than two learnable layers of parameters) to be successfully deployed in ap- 
plications. An early example was LeNet, which was used to classify low-resolution 
monochrome images of handwritten digits (LeCun et al., 1989; LeCun et al., 1998). 
The development of more powerful convolutional networks was accelerated through 
the introduction of a large-scale benchmark data set called ImageNet (Deng et al., 
2009) comprising some 14 million natural images each of which has been hand la- 
belled into one of nearly 22,000 categories. This was a much larger data set than had 
been used previously, and the advances in the field driven by ImageNet served to em- 
phasize the importance of large-scale data, alongside well-designed models having 
appropriate inductive biases, in building successful deep learning solutions. 

A subset of images comprising 1,000 non-overlapping categories formed the ba- 
sis for the annual ImageNet Large Scale Visual Recognition Challenge. Again, this 
was a much larger number of categories than the typically few dozen classes previ- 
ously considered. Having so many categories made the problem much more chal- 
lenging because, if the classes were distributed uniformly, random guessing would 
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have an error rate of 99.9%. The data set has just over 1.28 million training images, 
50,000 validation images, and 100,000 test images. The classifiers are designed to 
produce a ranked list of predicted output classes on test images, and results are re- 
ported in terms of top-1 and top-5 error rates, meaning an image is deemed to be 
correctly classified if the true class appears at the top of the list or if it is in one of the 
five highest-ranked class predictions. Early results with this data set achieved a top-5 
error rate of around 25.5%. An important advance was made by the AlexNet convo- 
lutional network architecture (Krizhevsky, Sutskever, and Hinton, 2012), which won 
the 2012 competition and reduced the top-S error rate to a new record of 15.3%. Key 
aspects of this model were the use of the ReLU activation function, the application of 
GPUs to train the network, and the use of dropout regularization. Subsequent years 
saw further advances, leading to error rates of around 3%, which is somewhat better 
than human-level performance for the same data, which is around 5% (Dodge and 
Karam, 2017). This can be attributed to the difficulty humans have in distinguishing 
subtly different classes (for example multiple varieties of mushrooms). 

As an example of a typical convolutional network architecture, we look in detail 
at the VGG-16 model (Simonyan and Zisserman, 2014), where VGG stands for the 
Visual Geometry Group, who developed the model, and 16 refers to the number of 
learnable layers in the model. VGG-16 has some simple design principles leading to 
a relatively uniform architecture, shown in Figure 10.10, that minimizes the number 
of hyperparameter choices that need to be made. It takes an input image having 
224 x 224 pixels and three colour channels, followed by sets of convolutional layers 
interspersed with down-sampling. All convolutional layers have filters of size 3 x 3 
with a stride of 1, same padding, and a ReLU activation function, whereas the max- 
pooling operations all use stride 2 and filter size 2 x 2 thereby down-sampling the 
number of units by a factor of 4. The first learnable layer is a convolutional layer in 
which each unit takes input from a 3 x 3 x 3 ‘cube’ from the stack of input channels 
and so has 28 parameters including the bias. These parameters are shared across 
all units in the feature map for that channel. There are 64 such feature channels in 
the first layer, giving an output tensor of size 224 x 224 x 64. The second layer 
is also convolutional and again has 64 channels. This is followed by max-pooling 
giving feature maps of size 112 x 112. Layers 3 and 4 are again convolutional, of 
dimensionality 112 x 112, and each was chosen to have 128 channels. This increase 
in the number of channels offsets to some extent the down-sampling in the max- 
pooling layer to ensure that the number of variables in the representation at each 
layer does not decrease too rapidly through the network. Again, this is followed by a 
max-pooling operation to give a feature map size of 56 x 56. Next come three more 
convolutional layers each with 256 channels, thereby again doubling the number of 
channels in association with the down-sampling. This is followed by another max- 
pooling to give feature maps of size 28 x 28 followed by three more convolutional 
layers each having 512 channels, followed by another max-pooling, which down- 
samples to feature maps of size 14 x 14. This is followed by three more convolutional 
layers, although the number of feature maps in these layers is kept at 512, followed 
by another max-pooling, which brings the size of the feature maps down to 7 x 7. 
Finally there are three more layers that are fully connected meaning that they are 
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Figure 10.10 The architecture of a typical convolutional network, in this case a model called VGG-16. 


standard neural network layers with full connectivity and no sharing of parameters. 
The final max-pooling layer has 512 channels each of size 7 x 7 giving 25,088 units 
in total. The first fully connected layer has 4,096 units, each of which is connected 
to each of the max-pooling units. This is followed by a second fully connected layer 
again with 4,096 units, and finally there is a third fully connected layer with 1,000 
units so that the network can be applied to a classification problem involving 1,000 
classes. All the learnable layers in the network have nonlinear ReLU activation 
functions, except for the output layer, which has a softmax activation function. In 
total there are roughly 138 million independently learnable parameters in VGG-16, 
the majority of which (nearly 103 million) are in the first fully connected layer, 
Exercise 10.8 whereas most of the connections are in the first convolutional layer. 

Earlier CNNs typically had fewer convolutional layers, as they had larger re- 
ceptive fields. For example, Alexnet (Krizhevsky, Sutskever, and Hinton, 2012) has 
11 x 11 receptive fields with a stride of 4. We saw in Figure 10.9 that larger receptive 
fields can also be achieved implicitly by using multiple layers each having smaller 
receptive fields. The advantage of the latter approach is that it requires significantly 
fewer parameters, effectively imposing an inductive bias on the larger filters as they 
must be composed of convolutional sub-filters. Although this is a highly complex 
architecture, only the network function itself needs to be coded explicitly since the 

Section 8.2 derivatives of the cost function can be evaluated using automatic differentiation and 
the cost function optimized using stochastic gradient descent. 
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10.3. Visualizing Trained CNNs 


We turn now to an exploration of the features learned by modern deep CNNs, and 
we will see some remarkable similarities to the properties of the mammalian visual 
cortex. 


10.3.1 Visual cortex 


Historically, much of the motivation for CNNs came from pioneering research 
in neuroscience, which gave insights into the nature of visual processing in mam- 
mals including humans. Electrical signals from the retina are transformed through 
a series of processing layers in the visual cortex, which is at the back of the brain, 
where the neurons are organized into two-dimensional sheets each of which forms a 
map of the two-dimensional visual field. In their pioneering work, Hubel and Wiesel 
(1959) measured the electrical responses of individual neurons in the visual cortex 
of cats while presenting visual stimuli to the cats’ eyes. They discovered that some 
neurons, called ‘simple cells’, have a strong response to visual inputs with a simple 
edge oriented at a particular angle and located at a particular position within the vi- 
sual field, whereas other stimuli generated relatively little response in those neurons. 
More detailed studies showed that the response of these simple cells can be modelled 
using Gabor filters, which are two-dimensional functions defined by 


G(a,y) = Aexp (—az? — By") sin (wT + ¢) (10.6) 

where 
T = (x — zo) cos(O) + (y — yo) sin(A) (10.7) 
y = —(a — zo) sin(6) + (y — yo) cos(6). (10.8) 


Equations (10.7) and (10.8) represent a rotation of the coordinate system through 
an angle 0 and therefore the sin(-) term in (10.6) represents a sinusoidal spatial 
oscillation oriented in a direction defined by the polar angle 0, with frequency w 
and phase angle @ . The exponential factor in (10.6) creates a decay envelope that 
localizes the filter in the neighbourhood of position (xo, yo) and with decay rates 
governed by a and 8. Example Gabor filters are shown in Figure 10.11. 

Hubel and Wiesel also discovered the presence of “complex cells’, which re- 
spond to more complex stimuli and which seem to be derived by combining and 
processing the output of simple cells. These responses exhibit some degree of invari- 
ance to small changes in the input such as shifts in location, analogous to the pooling 
units in a convolutional deep network. Deeper levels of the mammalian visual pro- 
cessing system have even more specific responses and even greater invariance to 
transformations of the visual input. Such cells have been termed ‘grandmother cells’ 
because such a cell could notionally respond if, and only if, the visual input corre- 
sponds to a person’s grandmother, irrespective of location, scale, lighting, or other 
transformations of the scene. This work directly inspired an early form of deep neu- 
ral network called the neocognitron (Fukushima, 1980), which was the forerunner of 


Figure 10.11 
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convolutional neural networks. The neocognitron had multiple layers of processing 
comprising local receptive fields with shared weights followed by local averaging or 
max-pooling to confer positional invariance. However, it lacked an end-to-end train- 
ing procedure since it predated the development of backpropagation, relying instead 
on greedy layer-wise learning through an unsupervised clustering algorithm. 


10.3.2 Visualizing trained filters 


Suppose we have a trained deep CNN and we wish to explore what the hidden 
units have learned to detect. For the filters in the first convolutional layer this is 
relatively straightforward, as they correspond to small patches in the original input 
image space, and so we can visualize the network weights associated with these 
filters directly as small images. The first convolutional layer computes inner products 
between the filters and the corresponding image patches, and so the unit will have a 
large activation when the inner product has a large magnitude. 

Figure 10.12 shows some example filters from the first layer of a CNN trained 
on the ImageNet data set. We see a remarkable similarity between these filters and 
the Gabor filters of Figure 10.11. However, this does not imply that a convolutional 
neural network is a good model of how the brain works, because very similar results 
can be obtained from a wide variety of statistical methods (Hyvärinen, Hurri, and 
Hoyer, 2009). This is because these characteristic filters are a general property of 
the statistics of natural images and therefore prove useful for image understanding 
in both natural and artificial systems. 

Although we can visualize the filters in the first layer directly, the subsequent 
layers in the network are harder to interpret because their inputs are not patches 
of images but groups of filter responses. One approach, analogous to that used by 
Hubel and Wiesel, is to present a large number of image patches to the network and 
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Figure 10.12 Examples of learned filters from 
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see which produce the highest activation value in any particular hidden unit. Fig- 
ure 10.13 shows examples obtained using a network with five convolutional layers, 
followed by two fully connected layers, trained on 1.3 million ImageNet data points 
spanning 1,000 classes. We see a natural hierarchical structure, with the first layer re- 
sponding to edges, the second layer responding to textures and simple shapes, layer 3 
showing components of objects (such as wheels), and layer 5 showing entire objects. 
We can extend this technique to go beyond simply selecting image patches from 
the validation set and instead perform a numerical optimization over the input vari- 
ables to maximize the activation of a particular unit (Zeiler and Fergus, 2013; Si- 
monyan, Vedaldi, and Zisserman, 2013; Yosinski et al., 2015). If we chose the unit 
to be one of the outputs then we can look for an image that is most representative 
of the corresponding class label. Because the output units generally have a softmax 
activation function, it is better to maximise the pre-activation value that feeds into 
the softmax rather than the class probability directly, as this ensures the optimization 
depends on only one class. For example, if we seek the image that produces the 
strongest response to the class ‘dog’, then if we optimize the softmax output it could 
drive the image to be, say, less like a cat because of the denominator in the softmax. 
This approach is related to adversarial training. Unconstrained optimization of the 
output-unit activation, however, leads to individual pixel values being driven to infin- 
ity and also creates high-frequency structure that is difficult to interpret, and so some 
form of regularization is required to find solutions that are closer to natural images. 
Yosinski et al. (2015) used a regularization function comprising the sum of squares 
of the pixel values along with a procedure that alternates gradient-based updates to 
the image pixel values with a blurring operation to remove high-frequency structure 
and a clipping operation that sets to zero those pixel values that make only small 
contributions to the class label. Example results are shown in Figure 10.14. 
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Figure 10.13 Examples of image patches (taken from a validation set) that produce the strongest activation in 
the hidden units in a network having five convolutional layers trained on ImageNet data. The top nine activations 
in each feature map are arranged as a 3 x 3 grid for four randomly chosen channels in each of the corresponding 
layers. We see a steady progression in complexity with depth, from simple edges in layer 1 to complete objects 
in layer 5. [From Zeiler and Fergus (2013) with permission.] 


10.3.3 Saliency maps 


Another way to gain insight into the features used by a convolutional network 
is through saliency maps, which aim to identify those regions of an image that are 
most significant in determining the class label. This is best done by investigating the 
final convolutional layer because this still retains spatial localization, which becomes 
lost in the subsequent fully connected layers, and yet it has the highest level of se- 
mantic representation. The Grad-CAM (gradient class activation mapping) method 
(Selvaraju et al., 2016) first computes, for a given input image, the derivatives of the 
output-unit pre-activation a‘°) for a given class c, before the softmax, with respect to 
the pre-activations a™ of all the units in the final convolutional layer for channel k. 
For each channel in that layer, the average of those derivatives is evaluated to give 


(0) 
On = 7g ae 2 a (10.9) 


where 7 and j index the rows and columns of channel k, and Mx is the total number 
of units in that channel. These averages are then used to form a weighted sum of the 
form: 


L= by a, A) (10.10) 


in which A“) is a matrix with elements as); The resulting array has the same 


dimensionality as the final convolutional layer, for example 14 x 14 for the VGG 
network shown in Figure 10.10, and can be superimposed on the original image in 
the form of a ‘heat map’ as seen in Figure 10.15. 
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Figure 10.14 Examples of synthetic images generated by maximizing the class probability with respect to the 
image pixel channel values for a trained convolutional classifier. Four different solutions, obtained with different 
settings of the regularization parameters, are shown for each of four object classes. [From Yosinski et al. (2015) 
with permission.] 


10.3.4 Adversarial attacks 


Gradients with respect to changes in the input image pixel values can also be 
used to create adversarial attacks against convolutional networks (Szegedy et al., 
2013). These attacks involve making very small modifications to an image, at a level 
that is imperceptible to a human, which cause the image to be misclassified by the 
neural network. One simple approach to creating adversarial images is called the 
fast gradient sign method (Goodfellow, Shlens, and Szegedy, 2014). This involves 
changing each pixel value in an image x by a fixed amount e with a sign determined 
by the gradient of an error function E(x, t) with respect to the pixel values. This 
gives a modified image defined by 


x’ = x +esign(V,E(x,t)). (10.11) 


Here t is the true label of x, and the error E(x, t) could, for example, be the neg- 
ative log likelihood of x. The required gradient can be computed efficiently using 
Chapter & backpropagation. During conventional training of a neural network, the network pa- 
rameters are adjusted to minimize this error, whereas the modification defined by 
(10.11) alters the image (while keeping the trained network parameters fixed) so as 


Figure 10.15 Saliency maps for the 
VGG-16 network with respect to the 
‘dog’ and ‘cat’ categories. [From Sel- 
varaju et al. (2016) with permission.] 


Original image Saliency map for ‘dog’ Saliency map for ‘cat’ 


Figure 10.16 Example of an adver- 
sarial attack against a trained convo- 
lutional network. The image on the 
left is classified as a panda with confi- 
dence 57.7%. The addition of a small 
level of a random-looking perturba- 
tion (that itself is classified as a ne- 
matode with confidence 8.2%) results 
in the image on the right, which is 
classified as a gibbon with confidence 
99.3%. [From Goodfellow, Shlens, and 
Szegedy (2014) with permission.] 
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to increase the error. By keeping € small, we ensure that the changes to the image are 
undetectable to the human eye. Remarkably, this can give images that are misclassi- 
fied by the network with high confidence, as seen in the example in Figure 10.16. 

The ability to fool neural networks in this way raises potential security concerns 
as it creates opportunities for attacking trained classifiers. It might appear that this 
issue arises from over-fitting, in which a high-capacity model has adapted precisely 
to the specific image such that small changes in the input produce large changes 
in the predicted class probabilities. However, it turns out that an image that has 
been adapted to give a spurious output for a particular trained network can give 
similarly spurious outputs when fed to other networks (Goodfellow, Shlens, and 
Szegedy, 2014). Moreover, a similar adversarial result can be obtained with much 
less flexible linear models. It is even possible to create physical artefacts such that 
a regular, uncorrupted image of the artefact will give erroneous predictions when 
presented to a trained neural network, as seen in Figure 10.17. Although these basic 
kinds of adversarial attacks can be addressed by simple modifications to the network 
training process, more sophisticated approaches are harder to defeat. Understanding 
the implications of these results and mitigating their potential pitfalls remain open 
areas of research. 


Figure 10.17 Two examples of physical stop 


signs that have been modified. Images of these 
objects are robustly classified as 45 mph speed- 
limit signs by CNNs. [From Eykholt et al. (2018) 
with permission.] 
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10.4. 


10.3.5 Synthetic images 


As a final example of image modification that provides additional insights into 
the operation of a trained convolutional network, we consider a technique called 
DeepDream (Mordvintsevy, Olah, and Tyka, 2015). The goal is to generate a synthetic 
image with exaggerated characteristics. We do this by determining which nodes in 
a particular hidden layer of the network respond strongly to a particular image and 
then modifying the image to amplify those responses. For example, if we present 
an image of some clouds to a network trained on object recognition and a particular 
node detects a cat-like pattern at a particular region of the image, then we modify 
the image to be more ‘cat like’ in that region. To do this, we apply an image to the 
input of the network and forward propagate through to some particular layer. We 
then set the backpropagation 6 variables for that layer equal to the pre-activations of 
the nodes and run backpropagation to the input layer to get a gradient vector over 
the pixels of the image. Finally, we modify the image by taking a small step in the 
direction of the gradient vector. This procedure can be viewed as a gradient-based 
method for increasing the function 


FI) = > ign (1) (10.12) 


i,j,k 


where a;jg(I) is the pre-activation of the unit in row i and column j of channel k in 
the chosen layer when the input image is I, and the sum is over all units and over 
all channels in that layer. To generate smooth-looking images, some regularization 
is applied in the form of spatial smoothing and pixel clipping. This process can 
then be repeated multiple times if stronger enhancements are desired. Examples 
of the resulting image are shown in Figure 10.18. It is interesting that even though 
convolutional networks are trained to discriminate between object classes, they seem 
able to capture at least some of the information needed to generate images from those 
classes. 

This technique can be applied to a photograph, or we can start with inputs con- 
sisting of random noise to obtain an image generated entirely from the trained net- 
work. Although DeepDream provides some insights into the operation of the trained 
network, it has primarily been used to generate interesting looking images as a form 
of artwork. 


Object Detection 


We have motivated the design of CNNs primarily by the image classification prob- 
lem, in which an entire image is assigned to a single class, for example ‘cat’ or 
‘bicycle’. This is reasonable for data sets such as ImageNet where, by design, each 
image is dominated by a single object. However, there are many other applications 
for CNNs that are able to exploit the inbuilt inductive biases. More generally, the 
convolutional layers of a CNN trained on a large image data base for a particular 
task can learn internal representations that have broad applicability, and therefore a 
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Figure 10.18 Examples of DeepDream applied to an image. The top row shows outputs when the algorithm 
is applied using the activations from the 7th convolutional layer of the VGG-16 network after five iterations and 
after 30 iterations. Similarly, the bottom row shows examples using the 10th layer, again after five iterations and 


after 30 iterations. 


Section 6.3.4 


Section 1.1.1 


CNN can be fine-tuned for a wide range of specific tasks. We have already seen 
an example of a convolutional network trained on ImageNet data, which through 
transfer learning was able to achieve human-level performance on skin lesion classi- 
fication. 


10.4.1 Bounding boxes 


Many images have multiple objects belonging to one or more classes, and we 
may wish to detect the presence and class of each object. Moreover, in many appli- 
cations of computer vision we also need to determine the locations within the image 
of any objects that are detected. For example, an autonomous vehicle that uses RGB 
cameras may need to detect the presence and location of pedestrians and also identify 
road signs, other vehicles, etc. 

Consider the problem of specifying the location of an object in an image. A 
widely used approach is to define a bounding box, which consists of a rectangle that 
fits closely to the boundary of the object, as illustrated in Figure 10.19. The bounding 
box can be defined by the coordinates of its centre along with its width and height in 
the form of a vector b = (bz, by, bw, bu). Here the elements of b can be specified 
in terms of pixels or as continuous numbers where, by convention, the top left of the 
image is given coordinates (0,0) and the bottom right is given coordinates (1, 1). 
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Figure 10.19 An image containing several objects from different classes in which the location of each 


Section 5.3 


object is labelled by a close-fitting rectangle known as a bounding box. Here blue boxes 
correspond to the class ‘car’, red boxes to the class ‘pedestrian’, and orange boxes to the 
class ‘traffic light’. [Original image courtesy of Wayve Technologies Ltd.] 


When images are assumed to contain one, and only one, object drawn from 
a predefined set of C classes, a CNN will generally have C output units whose 
activation functions are defined by the softmax function. An object can be localized 
by using an additional four outputs, with linear activation functions trained to predict 
the bounding box coordinates (bz, by, bw, bu). Since these quantities are continuous, 
a sum-of-squares error function over the corresponding outputs may be appropriate. 
This is used for example by Redmon et al. (2015), who first divide the image into a 
7 x 7 grid. For each grid cell, they use a convolutional network to output the class 
and bounding box coordinates of any object associated with that grid cell, based on 
features taken from the whole image. 


10.4.2 Intersection-over-union 


We need a meaningful way to measure the performance of a trained network that 
can predict bounding boxes. In image classification the output of the network is a 
probability distribution over class labels, and we can measure performance by look- 
ing at the log likelihood for the true class labels on a test set. For object localization, 
however, we need some way to measure the accuracy of a bounding box relative to 
some ground truth, where the latter could, for example, be obtained by human la- 
belling. The extent to which the predicted and target boxes overlap can be used as 
the basis for such a measure, but the area of the overlap will depend on the size of 
the object within the image. Also a predicted bounding box should be be penalized 
for the region of the prediction that lies outside the ground truth bounding box. A 
better metric that addresses both of these issues is called intersection-over-union, or 
IoU, and is simply the ratio of the area of the intersection of the two bounding boxes 
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Figure 10.20 Illustration of the 
intersection-over-union metric for 
quantifying the accuracy of a bound- 
ing box prediction. If the predicted 
bounding box is shown by the blue 
rectangle and the ground truth by the 
red rectangle, then the intersection- 
over-union is defined as the ratio of 
the area of the intersection of the 
boxes, shown in green on the left, 
divided by the area of their union, 
shown in green on the right. 


area of intersection area of union 


to that of their union, as illustrated in Figure 10.20. Note that the IoU measure lies 
in the range 0 to 1. Predictions can be labelled as correct if the IOU measure exceeds 
a threshold, which is typically set at 0.5. Note that IoU is not generally used directly 
as a loss function for training as it is hard to optimize by gradient descent, and so 
training is typically performed using centred objects, and the IoU score is mainly 
used an evaluation metric. 


10.4.3 Sliding windows 


One approach to object detection and object localization starts by creating a 
training set consisting of tightly cropped examples of the object to be detected, as 
well as examples of similarly cropped sections of images that do not contain any 
object (the ‘background’ class). This data set is used to train a classifier, such as 
a deep CNN, whose outputs represent the probability of there being an object of 
each particular class in the input window. The trained model is then used to detect 
objects in a new image by ‘scanning’ an input window across the image and, for 
each location, taking the resulting subset of the image as input to the classifier. This 
is called a sliding window. When an object is detected with high probability, the 
associated window location then defines the corresponding bounding box. 

One obvious drawback of this approach is that it can be computationally very 
costly due to the large number of potential window positions in the image. Further- 
more, the process may have to be repeated using windows of various scales to allow 
for different sizes of object within the image. A cost saving can be made by moving 
the input window in strides across the image, both horizontally and vertically, which 
are larger than one pixel. However, there is a trade-off between precision of location 
using a small stride and reducing the computational cost by using a larger stride. The 
computational cost of a sliding window approach may be reasonable for simple clas- 
sifiers, but for deep neural networks potentially containing millions of parameters, 
the cost of a naive implementation can be prohibitive. 

Fortunately, the convolutional structure of the neural network allows for a dra- 
matic improvement in efficiency (Sermanet ef al., 2013). We note that a convo- 
lutional layer within such a network itself involves sliding a feature detector, with 
shared weights, across the input image in strides. Consequently, when a sliding win- 
dow is used to generate multiple forward passes through a convolutional network 
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Figure 10.21 


Illustration of replicated calculations 
when a CNN is used to process data 
from a sliding input window, in which 
the red and blue boxes show two 
overlapping locations for the input 
window. The green box represents 
one of the locations for the receptive 
field of a hidden unit in the first con- 
volutional layer, and the evaluation of 
the corresponding hidden-unit activa- 
tion is shared across the two window 
positions. 


there is substantial redundancy in the computation, as illustrated in Figure 10.21. 

Because the computational structure of sliding windows mirrors that of convolu- 
tions, it turns out to be remarkably simple to implement sliding windows efficiently 
in a convolutional network. Consider the simplified convolutional network in Fig- 
ure 10.22, which consists of a convolutional layer followed by a max-pooling layer 
followed by a fully connected layer. For simplicity we have shown only a single 
channel in each layer, but the extension to multiple channels is straightforward. The 
input image to the network has size 6 x 6, the filters in the convolutional layer have 
size 3 x 3 with stride 1, and the max-pooling layer has non-overlapping receptive 
fields of size 2 x 2 with stride 1. This is followed by a fully connected layer with 
a single output unit. Note that we can also view this final layer as another convolu- 
tional layer with a filter size that is 2 x 2, so that there is only a single position for 
the filter and hence a single output. 

Now suppose this network is trained on centred images of objects and then ap- 
plied to a larger image of size 8 x 8, as shown in Figure 10.23 in which we simply 
enlarge the network by increasing the size of the convolutional and max-pooling 
layers. The convolution layer now has size 6 x 6 and the pooling layer has size 


Figure 10.22 Example of a simple 
convolutional network having a sin- 
gle channel at each layer used to il- 
lustrate the concept of a sliding win- 
dow for detecting objects in images. 


6 x 6 input image 


| Ss 


pooling 


fully connected 


a 
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3 x 3 convolution 


2 x 2 pooling 
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Figure 10.23 Application of the network shown in Figure 10.22 to a larger image in which the additional 
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computation required corresponds to the blue regions. 


3 x 3. There are now four output units each of which has its own softmax function. 
The weights into this unit are shared across the four units. We see that the calcula- 
tions needed to process the input corresponding to a window position in the top left 
corner of the input image are the same as those used to process the original 6 x 6 
inputs used in training. For the remaining window positions, only a small amount 
of additional computation is needed, as indicated by the blue squares, leading to a 
significant increase in efficiency compared to a naive repeated application of the full 
convolutional network. Note that the fully connected layers themselves now have a 
convolutional structure. 


10.4.4 Detection across scales 


As well as looking for objects in different positions in the image, we also need 
to look for objects at different scales and at different aspect ratios. For example, a 
tight bounding box drawn around a cat will have a different aspect ratio when the 
cat is sitting upright compared to when it is lying down. Instead of using multiple 
detectors with different sizes and shapes of input window, it is simpler but equivalent 
to use a fixed input window and to make multiple copies of the input image each with 
a different pair of horizontal and vertical scaling factors. The input window is then 
scanned over each of the image copies to detect objects, and the associated scaling 
factors are then used to transform the bounding box coordinates back into the original 
image space, as illustrated in Figure 10.24. 
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Figure 10.24 Illustration of the detection and localization of objects at multiple scales and aspect ratios using 
a fixed input window. The original image (a) is replicated multiple times and each copy is scaled in the horizontal 
and/or vertical directions, as illustrated for a horizontal scaling in (b). A fixed-sized window is then scanned 
over the scaled images. When an object is detected with high probability, as illustrated by the red box in (b), 
the corresponding window coordinates can be projected back into the original image space to determine the 
corresponding bounding box as shown in (c). 


10.4.5 Non-max suppression 


By scanning a trained convolutional network over an image, it is possible to 
detect multiple instances of the same class of object within the image as well as in- 
stances of objects from other classes. However, this also tends to produce multiple 
detections of the same object at similar locations, as illustrated in Figure 10.25. This 
can be addressed using non-max suppression, which, for each object class in turn, 
works as follows. It first runs the sliding window over the whole image and evaluates 
the probability of an object of that class being present at each location. Next it elim- 
inates all the associated bounding boxes whose probability is below some threshold, 
say 0.7, giving a result of the kind illustrated in Figure 10.25. The box with the 
highest probability is considered to be a successful detection, and the corresponding 
bounding box is recorded as a prediction. Next, any other boxes whose IoU with 
the successful detection box exceeds some threshold, say 0.5, is discarded. This is 
intended to eliminate multiple nearby detections of the same object. Then of the 
remaining boxes, the one with the highest probability is declared to be another suc- 
cessful detection, and the elimination step is repeated. The process continues until 
all bounding boxes have either been discarded or declared as successful detections. 


10.4.6 Fast region CNNs 


Another way to speed up object detection and localization is to note that a scan- 
ning window approach applies the full power of a deep convolutional network to 
all areas of the image, even though some areas may be unlikely to contain an ob- 
ject. Instead, we can apply some form of computationally cheaper technique, for 
example a segmentation algorithm, to identify parts of the image where there is a 
higher probability of finding an object, and then apply the full network only to these 
areas, leading to techniques such as fast region proposals with CNN or fast R-CNN 
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Figure 10.25 Schematic illustration of multiple de- 
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10.5. 


tections of the same object at nearby 
locations, along with their associ- 
ated probabilities. The red bounding = 
box corresponds to the highest over- | 
all probability. Non-max suppres- 
sion eliminates the other overlapping 
candidate bounding boxes shown in 
blue, while preserving the detection 
of another instance of the same ob- 
ject class shown by the bounding box 
in green. 


(Girshick, 2015). It is also possible to use a region proposal convolutional network 
to identify the most promising regions, leading to faster R-CNN (Ren et al., 2015), 
which allows end-to-end training of both the region proposal network and the detec- 
tion and localization network. 

A disadvantage of the sliding window approach is that if we want a very precise 
localization the objects then we need to consider large numbers of finely spaced win- 
dow positions, which becomes computationally costly. A more efficient approach is 
to combine sliding windows with the direct bounding box predictions that we dis- 
cussed at the start of this section (Sermanet et al., 2013). In this case, the continuous 
outputs predict the position of the bounding box relative to the window position and 
therefore provide some fine-tuning to the predicted position. 


Image Segmentation 


In an image classification problem, an entire image is assigned to a single class la- 
bel. We have seen that more detailed information is provided if multiple objects are 
detected and their positions recorded using bounding boxes. An even more detailed 
analysis is obtained with semantic segmentation in which every pixel of an image is 
assigned to one of a predefined set of classes. This means that the output space will 
have the same dimensionality as the input image and can therefore be conveniently 
represented as an image with the same number of pixels. Although the input image 
will generally have three channels for R, G, and B, the output array will have C 
channels, if there are C classes, representing the probability for each class. If we as- 
sociate a different (arbitrarily chosen) colour with each class, then the prediction of a 
segmentation network can be represented as an image in which each pixel is coloured 
according to the class having the highest probability, as illustrated in Figure 10.26. 


10.5.1 Convolutional segmentation 


A simple way to approach a semantic segmentation problem would be to con- 
struct a convolutional network that takes as input a rectangular section of the image 
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Figure 10.26 Example of an image and its corresponding semantic segmentation in which each pixel is 
coloured according to its class. For example, blue pixels correspond to the class ‘car’, red pixels to the class 
‘pedestrian’, and orange pixels to the class ‘traffic light’. [Courtesy of Wayve Technologies Ltd.] 


Figure 10.21 


Figure 10.23 


centred on a pixel and that has a single softmax output that classifies that pixel. By 
applying such a network to each pixel in turn, the entire image can be segmented 
(this would require edge padding around the image depending on the size of the 
input window). However, this approach would be extremely inefficient due to redun- 
dant calculations caused by overlapping patches. As we have seen, we can remove 
this inefficiency by grouping together the forward-pass calculations for different in- 
put locations into a single network, which results in a model in which the final fully 
connected layers are also convolutional. We could therefore create a CNN in which 
each layer has the same dimensionality as the input image, by having stride 1 at each 
layer with same padding and no pooling. Each output unit has a softmax activation 
function with weights that are shared across all outputs. Although this could work, 
such a network would still need many layers, with multiple channels in each layer, 
to learn the complex internal representations needed to achieve high accuracy, and 
overall this would be prohibitively costly for images of reasonable resolution. 


10.5.2 Up-sampling 


As we have already seen, most convolutional networks use several levels of 
down-sampling so that as the number of channels increases, the size of the feature 
maps decreases, keeping the overall size and cost of the network tractable, while 
allowing the network to extract semantically meaningful high-order features from the 
image. We can use this concept to create a more efficient architecture for semantic 
segmentation by taking a standard deep convolutional network and adding additional 
learnable layers that take the low-dimensional internal representation and transform 
it back up to the original image resolution (Long, Shelhamer, and Darrell, 2014; Noh, 
Hong, and Han, 2015; Badrinarayanan, Kendall, and Cipolla, 2015), as illustrated in 
Figure 10.27. 

To do this we need a way to reverse the down-sampling effects of strided convo- 
lutions and pooling operations. Consider first the up-sampling analogue of pooling, 
where the output layer has a larger number of units than the input layer, for example 
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Figure 10.27 Illustration of a convolutional neural network used for semantic image segmentation, showing 
the reduction in the dimensionality of the feature maps through a series of strided convolutions and/or pooling 
operations, followed by a series of transpose convolutions and/or unpooling which increase the dimensionality 
back up to that of the original image. 


with each input unit corresponding to a 2 x 2 block of output units. The question is 
then what values to use for the outputs. To find an up-sampling analogue of average 
pooling, we can simply copy over each input value into all the corresponding out- 
put units, as shown in Figure 10.28(a). We see that applying average pooling to the 
output of this operation regenerates the input. 

For max-pooling, we can consider the operation shown in Figure 10.28(b) in 
which each input value is copied into the first unit of the corresponding output block, 
and the remaining values in each block are set to zero. Again we see that apply- 
ing a max-pooling operation to the output layer regenerates the input layer. This 
is sometimes called max-unpooling. Assigning the non-zero value to the first ele- 
ment of the output block seems arbitrary, and so a modified approach can be used 
that also preserves more of the spatial information from the down-sampling layers 
(Badrinarayanan, Kendall, and Cipolla, 2015). This is done by choosing a network 
architecture in which each max-pooling down-sampling layer has a corresponding 


Figure 10.28 Illustration of unpooling operations showing (a) an analogue of average pooling and (b) an ana- 
logue of max-pooling. 
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intermediate layers 


Figure 10.29 Some of the spatial information from a max-pooling layer, shown on the left, can be preserved by 
noting the location of the maximum value for each 2 x 2 block in the input array, and then in the corresponding 
up-sampling layer, placing the non-zero entry at the corresponding location in the output array. 
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up-sampling layer later in the network. Then during down-sampling, a record is kept 
of which element in each block had the maximum value, and then in the correspond- 
ing up-sampling layer, the non-zero element is chosen to have the same location, as 
illustrated for 2 x 2 max-pooling in Figure 10.29. 


10.5.3 Fully convolutional networks 


The up-sampling methods considered above are fixed functions, much like the 
average-pooling and max-pooling down-sampling operations. We can also use a 
learned up-sampling that is analogous to strided convolution for down-sampling. In 
strided convolution, each unit on the output map is connected via shared learnable 
weights to a small patch on the input map, and as we move one step through the 
output array, the filter is moved two or more steps across the input array, and hence 
the output array has lower dimensionality than the input array. For up-sampling, we 
use a filter that connects one pixel in the input array to a patch in the output array, 
and then chose the architecture so that as we move one step across the input array, we 
move two or more steps across the output array (Dumoulin and Visin, 2016). This is 
illustrated for 3 x 3 filters, and an output stride of 2, in Figure 10.30. Note that there 
are output cells for which multiple filter positions overlap, and the corresponding 
output values can be found either by summing or by averaging the contributions 
from the individual filter positions. 

This up-sampling is called transpose convolution because, if the down-sampling 
convolution is expressed in matrix form, the corresponding up-sampling is given by 
the transpose matrix. It is also called ‘fractionally strided convolution’ because the 
stride of a standard convolution is the ratio of the step size in the output layer to the 
step size in the input layer. In Figure 10.30, for example, this ratio is 1/2. Note 
that this is sometimes also referred to as ‘deconvolution’, but it is better to avoid this 
term since deconvolution is widely used in mathematics to mean the inverse of the 
operation of convolution used in functional analysis, which is a different concept. If 
we have a network architecture with no pooling layers, so that the down-sampling 
and up-sampling are done purely using convolutions, then the architecture is known 
as a fully convolutional network (Long, Shelhamer, and Darrell, 2014). It can take 
an arbitrarily sized image and will output a segmentation map of the same size. 
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Figure 10.30 Illustration of transpose convolution for a 3 x output 5 x 5 
3 filter with an output stride of 2. This can 
be thought of as the inverse operation to a 
3 x 3 convolution. The red output patch is 
given by multiplying the kernel by the acti- 
vation of the red unit in the input layer, and 
similarly for the blue output patch. The ac- 
tivations of cells for which patches overlap 
are calculated by summing or averaging the 
contributions from the individual patches. 


input 2 x 2 


10.5.4 The U-net architecture 


We have seen that the down-sampling associated with strided convolutions and 
pooling allows the number of channels to be increased without the size of the net- 
work becoming prohibitive. This also has the effect of reducing the spatial resolution 
and hence discarding positional information as the signals flow through the network. 
Although this is fine for image classification, the loss of spatial information is a 
problem for semantic segmentation as we want to classify each pixel. One approach 
for addressing this is the U-net architecture (Ronneberger, Fischer, and Brox, 2015) 
illustrated in Figure 10.31, where the name comes from the U-shape of the diagram. 


Figure 10.31 The U-net architecture has a symmetrical arrangement of down-sampling and up-sampling lay- 
ers, and the output from each down-sampling layer is concatenated with the corresponding up-sampling layer. 
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Figure 10.32 An example of neural style transfer showing a photograph of a canal scene (left) that has been 
rendered in the style of The Wreck of a Transport Ship by J. M. W. Turner (centre) and in the style of The Starry 
Night by Vincent van Gogh (right). In each case the image used to provide the style is shown in the inset. [From 
Gatys, Ecker, and Bethge (2015) with permission.] 
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The core concept is that for each down-sampling layer there is a corresponding up- 
sampling layer, and the final set of channel activations at each down-sampling layer 
is concatenated with the corresponding first set of channels in the up-sampling layer, 
thereby giving those layers access to higher-resolution spatial information. Note that 
1 x 1 convolutions may be used in the final layer of a U-net to reduce the number 
of channels down to the number of classes, which is then followed by a softmax 
activation function. 


Style Transfer 


As we have seen, early layers in a deep convolutional network learn to detect simple 
features such as edges and textures whereas later layers learn to detect more complex 
entities such as objects. We can exploit this property to re-render an image in the 
style of a different image using a process called neural style transfer (Gatys, Ecker, 
and Bethge, 2015). This is illustrated in Figure 10.32. 

Our goal is to generate a synthetic image G whose ‘content’ is defined by an 
image C and whose ‘style’ is taken from some other image S. This is achieved 
by defining an error function E(G) given by the sum of two terms, one of which 
encourages G to have a similar content to C whereas the other encourages G to 
have a similar style to S: 


E(G) = Econtent (G, C) + Estyle(G, S). (10.13) 


The concepts of content and style are defined implicitly by the functional forms of 
these two terms. We can then find G by starting from a randomly initialized image 
and using gradient descent to minimize L(G). 

To define Exontent(G,C), we can pick a particular convolutional layer in the 
network and measure the activations of the units in that layer when image G is 
used as input and also when image C is used as input. We can then encourage the 
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corresponding pre-activations to be similar by using a sum-of-squares error function 
of the form 
2 
E content ( G, C) =y {aijk(G — aijk(C)} (10.14) 


i,j,k 


where a;jk(G) denotes the pre-activation of the unit at position (i, j) in channel k 
of that layer when the input image is G, and similarly for a;;,(C). The choice of 
which layer to use in defining the pre-activations will influence the final result, with 
earlier layers aiming to match low-level features like edges and later layers matching 
more complex structures or even entire objects. 

In defining Extyice(G, C), the intuition is that style is determined by the co- 
occurrence of features from different channels within a convolutional layer. For 
example, if the style image S is such that vertical edges are generally associated 
with orange blobs, then we would like the same to be true for the generated image 
G. However, although Eontent(G, C) tries to match features in G at the same lo- 
cations as corresponding features in C, for the style error Estyle(G, S) we want G 
to have characteristics that match those of S but taken from any location, and so 
we take an average over locations in a feature map. Again, consider a particular 
convolutional layer. We can measure the extent to which a feature in channel k co- 
occurs with the corresponding feature in channel k’ for input image G by forming 
the cross-correlation matrix 


Fpp (G SS a Jaijk' (G) (10.15) 


{=i j=l 


where J and J are the dimensions of the feature maps in this particular convolutional 
layer, and the product a;;;,a;;,” will be large if both features are activated. If there 
are K channels in this layer, then Fkgp form the elements of a K x K matrix, called 
the style matrix. We can measure the extent to which the two images G and S have 
the same style by comparing their style matrices using 


Estyte(G, S) rT 5 D {Frer (G) — Fer (S) Y. (10.16) 


k=1 k'=1 


Although we could again make use of a single layer, more pleasing results are ob- 
tained by using contributions from multiple layers in the form 


Esye(G,S) = AEiG, S) (10.17) 


where | denotes the convolutional layer. The coefficients \; determine the relative 
weighting between the different layers and also the weighting relative to the content 
error term. These weighting coefficients are adjusted empirically using subjective 
judgement. 
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Exercises 
10.1 


Appendix C 


10.2 


10.3 


10.4 


10.5 


(x) Consider a fixed weight vector w and show that the input vector x that maximizes 
the scalar product wx, subject to the constraint that ||x||? is constant, is given by 
x = aw for some scalar a. This can most easily be done using a Lagrange multiplier. 


(x x) Consider a convolutional network layer with a one-dimensional input array and 
a one-dimensional feature map as shown in Figure 10.2, in which the input array 
has dimensionality 5 and the filters have width 3 with a stride of 1. Show that this 
can be expressed as a special case of a fully connected layer by writing down the 
weight matrix in which missing connections are replaced by zeros and where shared 
parameters are indicated by using replicated entries. Ignore any bias parameters. 


(x) Explicitly calculate the output of the following convolution of a 4 x 4 input matrix 
with a 2 x 2 filter: 


(10.18) 


(xx) If an image I has J x K pixels and a filter K has L x M elements, write 
down the limits for the two summations in (10.2). In the mathematics literature, the 
operation (10.2) would be called a cross-correlation, whereas a convolution would 
be defined by 


CU.k) =X X_I} -l,k-m)K(l, m). (10.19) 
l m 


Write down the limits for the summations in (10.19). Show that (10.19) can be 
written in the equivalent ‘flipped’ form 


CU,k) =X X IG +l, k+m)K(l, m) (10.20) 
l m 


and again write down the limits for the summations. 


(x) In mathematics, a convolution for a continuous variable x is defined by 
F(x) -| G(y)k(a — y) dy (10.21) 


where k(x — y) is the kernel function. By considering a discrete approximation to 
the integral, explain the relationship to a convolutional layer, defined by (10.19), in 
a CNN. 


10.6 


10.7 


10.8 


10.9 


10.10 


10.11 


10.12 
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(x) Consider an image of size J x K that is padded with an additional P pixels on 
all sides and which is then convolved using a kernel of size M x M where M is an 
odd number. Show that if we choose P = (M — 1)/2, then the resulting feature map 
will have size J x K and hence will be the same size as the original image. 


(x) Show that if a kernel of size M x M is convolved with an image of size J x K with 
padding of depth P and strides of length S then the dimensionality of the resulting 
feature map is given by (10.5) 


(x x) For each of the 16 layers in the VGG-16 CNN shown in Figure 10.10, evaluate 
(i) the number of weights (i.e., connections) including biases and (ii) the number 
of independently learnable parameters. Confirm that the total number of learnable 
parameters in the network is approximately 138 million. 


(x x) Consider a convolution of the form (10.2) and suppose that the kernel is sepa- 
rable so that 
K(l,m) = F(lI)G(m) (10.22) 


for some functions F'(-) and G(-). Show that instead of performing a single two- 
dimensional convolution it is now possible to compute the same answer using two 
one-dimensional convolutions thereby resulting in a significant improvement in effi- 
ciency. 


(x) The DeepDream update procedure involves setting the ô variables for backprop- 
agation equal to the pre-activations of the nodes in the chosen layer and then running 
backpropagation to the input layer to get a gradient vector over the pixels of the im- 
age. Show that this can be derived as a gradient optimization with respect to the 
pixels of an image I applied to the function (10.12). 


(xx) When designing a neural network to detect objects from C different classes in 
an image, we can use a 1-of-(C +1) class label with one variable for each object class 
and one additional variable representing a ‘background’ class, i.e., an input image 
region that does not contain an object belonging to any of the defined classes. The 
network will then output a vector of probabilities of length (C + 1). Alternatively, 
we can use a single binary variable to denote the presence or absence of an object 
and then use a separate 1-of-C' vector to denote the specific object class. In this 
case, the network outputs a single probability representing the presence of an object 
and a separate set of probabilities over the class label. Write down the relationship 
between these two sets of probabilities. 


(xx) Calculate the number of computational steps required to make one forward 
pass through the convolutional network shown in Figure 10.22, ignoring biases and 
ignoring the evaluation of activation functions. Similarly, calculate the total num- 
ber of computational steps for a single forward pass through the expanded network 
shown in Figure 10.23. Finally, evaluate ratio of nine repeated naive applications of 
the network in Figure 10.22 to an 8 x 8 image compared to a single application of 
the network in Figure 10.23. This ratio indicates the improvement in efficiency from 
using a convolutional implementation of the sliding window technique. 
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10.13 (xx) In this exercise we use one-dimensional vectors to demonstrate why a con- 


volutional up-sampling is sometimes called a transpose convolution. Consider a 
one-dimensional strided convolutional layer with an input having four units with ac- 
tivations (£1, £2, 23,24), which is padded with zeros to give (0, £1, £2, £3, Xa, 0), 
and a filter with parameters (w1, w2, w3). Write down the one-dimensional activa- 
tion vector of the output layer assuming a stride of 2. Express this output in the 
form of a matrix A multiplied by the vector (0, £1, £2, £3, %4,0). Now consider 
an up-sampling convolution in which the input layer has activations (z1, z2) with 
a filter having values (w1, w2, w3) and an output stride of 2. Write down the six- 
dimensional output vector assuming that overlapping filter values are summed and 
that the activation function is just the identity. Show that this can be expressed as a 
matrix multiplication using the transpose matrix AT. 


Check for 
updates 


11 


Structured 
Distributions 


We have seen that probability forms one of the most important foundational concepts 
for deep learning. For example, a neural network used for binary classification is 
described by a conditional probability distribution of the form 


p(t|x, w) = y(x, w) {1 — y(x, w)} E7” (11.1) 


where y(x, w) represents a neural network function that takes a vector x as input 
and is governed by a vector w of learnable parameters. The corresponding cross- 
entropy likelihood forms the basis for defining an error function used to train the 
neural network. Although the network function might be extremely complex, the 
conditional distribution in (11.1) has a simple form. However, there are many im- 
portant deep learning models that have a much richer probabilistic structure, such as 
large language models, normalizing flows, variational autoencoders, diffusion mod- 
els, and many others. To describe and exploit this structure, we introduce a powerful 
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11.1. 


framework called probabilistic graphical models, or simply graphical models, which 
allows structured probability distributions to be expressed in graphical form. When 
combined with neural networks to define associated probability distributions, graph- 
ical models offer huge flexibility when creating sophisticated models that can be 
trained end to end using stochastic gradient descent in which gradients are evaluated 
efficiently using auto-differentiation. In this chapter, we will focus on the core con- 
cepts of graphical models needed for applications in deep learning, whereas a more 
comprehensive treatment of graphical models for machine learning can be found in 
Bishop (2006). 


Graphical Models 


Probability theory can be expressed in terms of two simple equations known as the 
sum rule and the product rule. All of the probabilistic manipulations discussed in 
this book, no matter how complex, amount to repeated application of these two 
equations. In principle, we could therefore formulate and use complex probabilistic 
models purely by using algebraic manipulation. However, we will find it advan- 
tageous to augment the analysis using diagrammatic representations of probability 
distributions, as these offer several useful properties: 


1. They provide a simple way to visualize the structure of a probabilistic model 
and can be used to design and motivate new models. 


2. Insights into the properties of the model, including conditional independence 
properties, can be obtained by inspecting the graph. 


3. The complex computations required to perform inference and learning in so- 
phisticated models can be expressed in terms of graphical operations, such as 
message-passing, in which the underlying mathematical operations are carried 
out implicitly. 


Although such graphical models have nodes and edges much like neural network 
diagrams, their interpretation is specifically probabilistic and carries a richer seman- 
tics. To help avoid confusion, in this book we denote neural network diagrams in 
blue and probabilistic graphical models in red. 


11.1.1 Directed graphs 


A graph comprises nodes, also called vertices, connected by links, also known 
as edges. In a probabilistic graphical model, each node represents a random variable, 
and the links express probabilistic relationships between these variables. The graph 
then captures the way in which the joint distribution over all the random variables 
can be decomposed into a product of factors each depending only on a subset of the 
variables. In this chapter we will focus on graphical models in which the links of the 
graphs have a particular direction indicated by arrows. These are known as directed 
graphical models and are also called Bayesian networks or Bayes nets. 
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Figure 11.1 A directed graphical model representing the joint probability distri- 
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position on the right-hand side of (11.3). 


bution over three variables a, b, and c, corresponding to the decom- Cae 


The other major class of graphical models are Markov random fields, also known 
as undirected graphical models, in which the links do not carry arrows and have no 
directional significance. Directed graphs are useful for expressing causal relation- 
ships between random variables, whereas undirected graphs are better suited to ex- 
pressing soft constraints between random variables. Both directed and undirected 
graphs can be viewed as special cases of a representation called factor graphs. From 
now on we focus our attention on directed graphical models. Note, however, that 
undirected graphs, without the probabilistic interpretation, will also arise in our dis- 
cussion of graph neural networks in which the nodes represent deterministic vari- 
ables as in standard neural networks. 


11.1.2 Factorization 


To motivate the use of directed graphs to describe probability distributions, con- 
sider first an arbitrary joint distribution p(a, b,c) over three variables a, b, and c. 
Note that at this stage, we do not need to specify anything further about these vari- 
ables, such as whether they are discrete or continuous. Indeed, one of the powerful 
aspects of graphical models is that a specific graph can make probabilistic statements 
for a broad class of distributions. By application of the product rule of probability 
(2.9), we can write the joint distribution in the form 


p(a, b, c) = p(cla, b)p(a, b). (11.2) 


A second application of the product rule, this time to the second term on the right- 
hand side of (11.2), gives 


pla, b,c) = p(cla, b)p(bja)p(a). (11.3) 


Note that this decomposition holds for any choice of the joint distribution. We now 
represent the right-hand side of (11.3) in terms of a simple graphical model as fol- 
lows. First we introduce a node for each of the random variables a, b, and c and as- 
sociate each node with the corresponding conditional distribution on the right-hand 
side of (11.3). Then, for each conditional distribution we add directed links (depicted 
as arrows) from the nodes corresponding to the variables on which the distribution is 
conditioned. Thus, for the factor p(c|a, b), there will be links from nodes a and b to 
node c, whereas for the factor p(a), there will be no incoming links. The result is the 
graph shown in Figure 11.1. If there is a link going from node a to node b, then we 
say that node a is the parent of node b, and we say that node b is the child of node a. 
Note that we will not make any formal distinction between a node and the variable 
to which it corresponds but will simply use the same symbol to refer to both. 
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Figure 11.2 Example of a directed graph describing the joint distri- 
bution over variables x1,...,x27. The corresponding de- 
composition of the joint distribution is given by (11.5). oe 


An important point to note about (11.3) is that the left-hand side is symmetrical 
with respect to the three variables a, b, and c, whereas the right-hand side is not. In 
making the decomposition in (11.3), we have implicitly chosen a particular ordering, 
namely a,b,c, and had we chosen a different ordering we would have obtained a 
different decomposition and hence a different graphical representation. 

For the moment let us extend the example of Figure 11.1 by considering the 
joint distribution over K variables given by p(x1,..., xg). By repeated application 
of the product rule of probability, this joint distribution can be written as a product 
of conditional distributions, one for each of the variables: 


plz,- £K) = p(@K|21,...,€K-1)..-p(%2|"1)p(#1). (11.4) 


For a given choice of K, we can again represent this as a directed graph having K 
nodes, one for each conditional distribution on the right-hand side of (11.4), with 
each node having incoming links from all lower numbered nodes. We say that this 
graph is fully connected because there is a link between every pair of nodes. 

So far, we have worked with completely general joint distributions, and so their 
factorization, and associated representation as fully connected graphs, will be appli- 
cable to any choice of distribution. As we will see shortly, it is the absence of links 
in the graph that conveys interesting information about the properties of the class 
of distributions that the graph represents. Consider the graph shown in Figure 11.2. 
Note that it is not a fully connected graph because, for instance, there is no link from 
x1 to £2 or from x3 to x7. We take this graph and extract the corresponding repre- 
sentation of the joint probability distribution written in terms of the product of a set 
of conditional distributions, one for each node in the graph. Each such conditional 
distribution will be conditioned only on the parents of the corresponding node in the 
graph. For instance, x; will be conditioned on xı and x3. The joint distribution of 
all seven variables is therefore given by 


p(zı)p(£2)p(x3)p(x4|£1, £2, x3)p(£5|x1, £3)p(xe|£4)p(£7|£4, z5). (11.5) 


The reader should take a moment to study carefully the correspondence between 
(11.5) and Figure 11.2. 

We can now state in general terms the relationship between a given directed 
graph and the corresponding distribution over the variables. The joint distribution 
defined by a graph is given by the product, over all of the nodes of the graph, of 


Exercise 11.1 


Exercise 11.2 
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a conditional distribution for each node conditioned on the variables corresponding 
to the parents of that node in the graph. Thus, for a graph with K nodes, the joint 
distribution is given by 


K 
Wind £K) = ] [ p(ærlpa(k)) (11.6) 


k=1 


where pa(k) denotes the set of parents of xx. This key equation expresses the factor- 
ization properties of the joint distribution for a directed graphical model. Although 
we have considered each node to correspond to a single variable, we can equally 
well associate sets of variables and vector-valued or tensor-valued variables with the 
nodes of a graph. It is easy to show that the representation on the right-hand side of 
(11.6) is always correctly normalized provided the individual conditional distribu- 
tions are normalized. 

The directed graphs that we are considering are subject to an important restric- 
tion, namely that there must be no directed cycles. In other words, there are no closed 
paths within the graph such that we can move from node to node along links follow- 
ing the direction of the arrows and end up back at the starting node. Such graphs are 
also called directed acyclic graphs, or DAGs. This is equivalent to the statement that 
there exists an ordering of the nodes such that there are no links that go from any 
node to any lower-numbered node. 


11.1.3 Discrete variables 


We have discussed the importance of probability distributions that are mem- 
bers of the exponential family, and we have seen that this family includes many 
well-known distributions as special cases. Although such distributions are relatively 
simple, they form useful building blocks for constructing more complex probability 
distributions, and the framework of graphical models is very useful in expressing 
the way in which these building blocks are linked together. There are two particu- 
lar choices for the component distributions that are widely used, corresponding to 
discrete variables and to Gaussian variables. We begin by examining the discrete 
case. 

The probability distribution p(x|s) for a single discrete variable x having K 
possible states (using the 1-of-K representation) is given by 


K 
plu) = | [ni (11.7) 
k=1 
and is governed by the parameters u = (41,..-, ug)". Due to the constraint 


Š; Hk = 1, only K — 1 values for uy need to be specified to define the distribution. 

Now suppose that we have two discrete variables, x, and x2, each of which has 
K states, and we wish to model their joint distribution. We denote the probability of 
observing both zı% = 1 and x2; = 1 by the parameter uki, where x1, denotes the 
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Figure 11.3 


(a) This fully connected 
graph describes a general 


H Hı H2 
distribution over two K-state 
discrete variables having a 
total of K? — 1 parameters. 
(b) By dropping the link 
(a) (b) 


between the nodes, the 
number of parameters is 
reduced to 2(K — 1). 


kth component of x1, and similarly for x2;. The joint distribution can be written 


p(xı, X2| u) = pT. 


Because the parameters up, are subject to the constraint X` k 5 ı Hki = 1, this distri- 
bution is governed by K? — 1 parameters. It is easily seen that the total number of 
parameters that must be specified for an arbitrary joint distribution over M variables 
is K™ — 1 and therefore grows exponentially with the number M of variables. 

Using the product rule, we can factor the joint distribution p(x1, X2) in the form 
p(X2|X1)p(x1), which corresponds to a two-node graph with a link going from the 
xı node to the xz node as shown in Figure 11.3(a). The marginal distribution p(x,) 
is governed by K — 1 parameters, as before. Similarly, the conditional distribution 
p(X2|X1) requires the specification of K — 1 parameters for each of the K possible 
values of xı. The total number of parameters that must be specified in the joint 
distribution is therefore (K — 1) + K(K — 1) = K? — 1 as before. 

Now suppose that the variables x; and x2 are independent, corresponding to 
the graphical model shown in Figure 11.3(b). Each variable is then described by a 
separate discrete distribution, and the total number of parameters would be 2(K —1). 
For a distribution over M independent discrete variables, each having K states, the 
total number of parameters would be M(K — 1), which therefore grows linearly 
with the number of variables. From a graphical perspective, we have reduced the 
number of parameters by dropping links in the graph, at the expense of having a 
more restricted class of distributions. 

More generally, if we have M discrete variables x1, ..., Xm, we can model the 
joint distribution using a directed graph with one variable for each node. The condi- 
tional distribution at each node is given by a set of non-negative parameters subject 
to the usual normalization constraint. If the graph is fully connected, then we have a 
completely general distribution having K™ — 1 parameters, whereas if there are no 
links in the graph, the joint distribution factorizes into the product of the marginal 
distributions, and the total number of parameters is M (J — 1). Graphs having in- 
termediate levels of connectivity allow for more general distributions than the fully 
factorized one while requiring fewer parameters than the general joint distribution. 
As an illustration, consider the chain of nodes shown in Figure 11.4. The marginal 
distribution p(x,) requires K — 1 parameters, whereas each of the M — 1 condi- 
tional distributions p(x;|x;_1), for i = 2,..., M, requires K(K — 1) parameters. 


Figure 11.4 
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Figure 11.5 
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This chain of M discrete nodes, each hav- mA Ho Hm 
ing K states, requires the specification of | 


k& —1+(M —1)K(K — 1) parameters, which 
grows linearly with the length M of the chain. 
In contrast, a fully connected graph of M (=) (=) T (en) 
nodes would have K™ —1 parameters, which 


grows exponentially with M. 


This gives a total parameter count of K — 1+ (M — 1)K(K — 1), which is quadratic 
in K and which grows linearly (rather than exponentially) with the length M of the 
chain. 

An alternative way to reduce the number of independent parameters in a model 
is by sharing parameters (also known as tying of parameters). For instance, in the 
chain example of Figure 11.4, we can arrange that all the conditional distributions 
p(xi|Xxi—1), for i = 2,..., M, are governed by the same set of K(K — 1) param- 
eters, giving the model shown in Figure 11.5. Together with the K — 1 parameters 
governing the distribution of x,, this gives a total of K? — 1 parameters that must be 
specified to define the joint distribution. 

Another way of controlling the exponential growth of the number of parameters 
in models of discrete variables is to use parameterized representations for the condi- 
tional distributions instead of complete tables of conditional probability values. To 
illustrate this idea, consider the graph in Figure 11.6 in which all the nodes represent 
binary variables. Each of the parent variables x; is governed by a single parame- 
ter u; representing the probability p(x; = 1), giving M parameters in total for the 
parent nodes. The conditional distribution p(y|a1,..., £m), however, would require 
2M parameters representing the probability p(y = 1) for each of the 2” possible 
settings of the parent variables. Thus, in general the number of parameters required 
to specify this conditional distribution will grow exponentially with M. We can ob- 
tain a more parsimonious form for the conditional distribution by using a logistic 
sigmoid function acting on a linear combination of the parent variables, giving 


M 


p(y = 1 ea wel So (m+ Zua) = o(w'x) (11.8) 


i=1 


where o(a) = (1+exp(—a))~? is the logistic sigmoid, x = (£0, £1, ..., £m)" is an 
(M + 1)-dimensional vector of parent states augmented with an additional variable 
£o Whose value is clamped to 1, and w = (wo, w1, ..., wm)” is a vector of M + 1 
parameters. This is a more restricted form of conditional distribution than the general 
case but is now governed by a number of parameters that grows linearly with M. In 


As in Figure 11.4 but with a single set of 
parameters u shared amongst all the condi- 


Hı H 
tional distributions p(x;|x;_1). | | 
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Figure 11.6 A graph comprising M parents zxı,..., xm anda single 
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Exercise 11.6 


child y, used to illustrate the idea of parameterized con- 
ditional distributions for discrete variables. 


this sense, it is analogous to the choice of a restrictive form of covariance matrix (for 
example, a diagonal matrix) in a multivariate Gaussian distribution. 


11.1.4 Gaussian variables 


We now turn to graphical models in which the nodes represent continuous vari- 
ables having Gaussian distributions. Each distribution is conditioned on the state 
of its parents in the graph. That dependence could take many forms, and here we 
focus on a specific choice in which the mean of each Gaussian is some linear func- 
tion of the states of the Gaussian parent variables. This leads to a class of models 
called linear-Gaussian models, which include many cases of practical interest such 
as probabilistic principal component analysis, factor analysis, and linear dynamical 
systems (Roweis and Ghahramani, 1999). 

Consider an arbitrary directed acyclic graph over D variables in which node i 
represents a single continuous random variable x; having a Gaussian distribution. 
The mean of this distribution is taken to be a linear combination of the states of its 
parent nodes pa(i) of node i: 


p(zilpa(é)) =N | xi] XO wijz + bivi (11.9) 
jEpa(i) 
where w;j and b; are parameters governing the mean and v; is the variance of the 


conditional distribution for x;. The log of the joint distribution is then the log of the 
product of these conditionals over all nodes in the graph and hence takes the form 


D 
Inp(x) = J Inp(zilpa(i)) (11.10) 
4=1 
2: 
D 
1 
= po ti- Š, Wijzj—b +const (11.11) 
r= jEpa(i) 


where x = (z1,..., £p)" and ‘const’ denotes terms independent of x. We see that 
this is a quadratic function of the components of x, and hence the joint distribution 
p(x) is a multivariate Gaussian. 

We can find the mean and covariance of this joint distribution as follows. The 
mean of each variable is given by the recursion relation 


Ex] = y Wij Efx;] + bi. (11.12) 


jEpa(i) 
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A directed graph over three Gaussian variables with one 
missing link. 


and so we can find the components of E[x] = (E[x,],...,E[ap])" by starting at 
the lowest numbered node and working recursively through the graph, where we 
assume that the nodes are numbered such that each node has a higher number than 
its parents. Similarly, the elements of the covariance matrix of the joint distribution 
satisfy a recursion relation of the form 


cov|z;, zj] = `y Wjkcov[zi, £k] + Tijvj (11.13) 
kEpa(j) 


and so the covariance can similarly be evaluated recursively starting from the lowest 
numbered node. 

We now consider two extreme cases of possible graph structures. First, suppose 
that there are no links in the graph, which therefore comprises D isolated nodes. 
In this case, there are no parameters w;j and so there are just D parameters b; and 
D parameters v;. From the recursion relations (11.12) and (11.13), we see that the 
mean of p(x) is given by (b1,...,6p)* and the covariance matrix is diagonal of 
the form diag(v),...,up). The joint distribution has a total of 2D parameters and 
represents a set of D independent univariate Gaussian distributions. 

Now consider a fully connected graph in which each node has all lower num- 
bered nodes as parents. In this case the total number of independent parameters 
{w;;} and {v;} in the covariance matrix is D(D + 1)/2 corresponding to a general 
symmetric covariance. 

Graphs having some intermediate level of complexity correspond to joint Gaus- 
sian distributions with partially constrained covariance matrices. Consider for exam- 
ple the graph shown in Figure 11.7, which has a link missing between variables x, 
and x3. Using the recursion relations (11.12) and (11.13), we see that the mean and 
covariance of the joint distribution are given by 


u = (bı, b2 + w21b1, b3 + W32b2 + wz2w21b1)” (11.14) 
vı W21U1 W32W21U1 
> => W221 U1 V2 + we) W32(U2 + W310) i (11.15) 


2 2 2 
W32W21V1 W32(V2 + W31V1) V3 + W39(V2 + W911) 


We can readily extend the linear-Gaussian graphical model to a situation in 
which the nodes of the graph represent multivariate Gaussian variables. In this case, 
we can write the conditional distribution for node 7 in the form 


p(xi|pa(i)) =N | xi} XO Wax; + bi, Xi (11.16) 
j€pa(t) 
where now W;;; is a matrix (which is non-square if x; and x; have different dimen- 


sionality). Again it is easy to verify that the joint distribution over all variables is 
Gaussian. 
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Figure 11.8 
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Figure 11.9 


Directed graphical model representing the binary classifier 
model described by the joint distribution (11.17) showing only 
the stochastic variables {ti,...,tv} and w. 


11.1.5 Binary classifier 


We can illustrate the use of directed graphs to describe probability distributions 
using a two-class classifier model with Gaussian prior over the learnable parameters. 
We can write this in the form 


N 
p(t, w|X, A) = p(w|A) pii (tn|W, Xn) (11.17) 
where t = (t1,...,¢ n)! is the vector of target values, X is the data matrix with 
rows X},...Xpy, and the distribution p(t|x, w) is given by (11.1). We also assume a 


Gaussian prior over the parameter vector w given by 
p(w|A) = N (w]0, AI). (11.18) 


The stochastic variables in this model are {t,,...,t,} and w. In addition, this 
model contains the noise variance o° and the hyperparameter A, both of which are 
parameters of the model rather than stochastic variables. If we consider for a mo- 
ment only the stochastic variables, then the distribution given by (11.17) can be 
represented by the graphical model shown in Figure 11.8. 

When we start to deal with more complex models, it becomes inconvenient to 
have to write out multiple nodes of the form ¢;,...,¢v explicitly as in Figure 11.8. 
We therefore introduce a graphical notation that allows such multiple nodes to be ex- 
pressed more compactly. We draw a single representative node t,, and then surround 
this with a box, called a plate, labelled with N to indicate that there are N nodes of 
this kind. Rewriting the graph of Figure 11.8 in this way, we obtain the graph shown 
in Figure 11.9. 


11.1.6 Parameters and observations 


We will sometimes find it helpful to make the parameters of a model, as well 
as its stochastic variables, explicit in the graphical representation. To do this, we 
will adopt the convention that random variables are denoted by open circles and 
deterministic parameters are denoted by floating variables. If we take the graph of 


An alternative, more compact, representation of the graph shown 

in Figure 11.8 in which we have introduced a plate (the box la- 

belled JN) that represents N nodes of which only a single exam- (w) 
ple tn is shown explicitly. 
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Figure 11.10 The same model as in Figure 11.9 but with the determin- 


Figure 11.11 


istic parameters shown explicitly by the floating variables. À 


Figure 11.9 and include the deterministic parameters, we obtain the graph shown in 
Figure 11.10. 

When we apply a graphical model to a problem in machine learning, we will 
typically set some of the random variables to specific observed values. For example, 
the stochastic variables {tn } in the linear regression model will be set equal to the 
specific values given in the training set. In a graphical model, we denote such ob- 
served variables by shading the corresponding nodes. Thus, the graph corresponding 
to Figure 11.10 in which the variables {t,,} are observed is shown in Figure 11.11. 

Note that the value of w is not observed, and so w is an example of a latent 
variable, also known as a hidden variable. Such variables play a crucial role in many 
of the models discussed in this book. We therefore have three kinds of variables in a 
directed graphical model. First, there are unobserved (also called latent, or hidden) 
stochastic variables, which are denoted by open red circles. Second, when stochastic 
variables are observed, so that that they are set to specific values, they are denoted 
by red circles shaded with blue. Finally, non-stochastic parameters are denoted by 
floating variables, as seen in Figure 11.11. 

Note that model parameters such as w are generally of little direct interest in 
themselves, because our ultimate goal is to make predictions for new input values. 
Suppose we are given a new input value % and we wish to find the corresponding 
probability distribution for t conditioned on the observed data. The joint distribution 
of all the random variables in this model, conditioned on the deterministic parame- 
ters, is given by 


N 
p(t, t, w|@, X, A) = p(w|A)p(t|w, 2) pii (tn|W, Xn) (11.19) 
and the corresponding graphical model is shown in Figure 11.12. 


As in Figure 11.10 but with the nodes {tn } shaded to indi- 
cate that the corresponding random variables have been 
set to their observed values given by the training set. 


x 
COT @-- 
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Figure 11.12 The classification model, corresponding to Fig- 


ure 11.11, showing a new input value £ together with 
the corresponding model prediction #. 


— 
—— 


Og9 


N 


The required predictive distribution for ¢ is then obtained from the sum rule 
of probability by integrating out the model parameters w. This integration over 
parameters represents a fully Bayesian treatment, which is rarely used in practice, 
especially with deep neural networks. Instead, we approximate this integral by first 
finding the most probable value wmap that maximizes the posterior distribution and 
then using just this single value to make predictions using p(t|wr AP, È). 


11.1.7 Bayes’ theorem 


When stochastic variables in a probabilistic model are set equal to observed val- 
ues, the distributions over other unobserved stochastic variables change accordingly. 
The process of calculating these updated distributions is known as inference. We can 
illustrate this by considering the graphical interpretation of Bayes’ theorem. Suppose 
we decompose the joint distribution p(x, y) over two variables x and y into a product 
of factors in the form p(x, y) = p(x)p(y|x). This can be represented by the directed 
graph shown in Figure 11.13(a). Now suppose we observe the value of y, as indi- 
cated by the shaded node in Figure 11.13(b). We can view the marginal distribution 
p(x) as a prior over the latent variable x, and our goal is to infer the corresponding 
posterior. Using the sum and product rules of probability we can evaluate 


p(y) = X plyla’)p(2’), (11.20) 
which can then be used in Bayes’ theorem to calculate 
x) p(x 
p(z|y) = Plule)ple) (11.21) 
p(y) 


Thus, the joint distribution is now expressed in terms of p(a|y) and p(y). From 
a graphical perspective, the joint distribution p(x, y) is represented by the graph 
shown in Figure 11.13(c), in which the direction of the arrow is reversed. This is the 
simplest example of an inference problem for a graphical model. 

For complex graphical models that capture rich probabilistic structure, the pro- 
cess of calculating posterior distributions once some of the stochastic variables are 
observed can be complex and subtle. Conceptually, it simply involves the systematic 
application of the sum and product rules of probability, or equivalently Bayes’ theo- 
rem. In practice, however, managing these calculations efficiently can benefit greatly 
from an exploitation of the graphical structure. These calculations can be expressed 
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Figure 11.13 A graphical representation 


11.2. 


of Bayes’ theorem showing 
(a) a joint distribution over 
two variables x and y ex- 


pressed in factorized form, 

(b) the case with y set to an © 

observed value, and (c) the 

resulting posterior distribu- (a) (b) (c) 
tion over x, given by Bayes’ 

theorem. 


in terms of elegant calculations on the graph that involve sending local messages be- 
tween nodes. Such methods give exact answers for tree-structured graphs and give 
approximate iterative algorithms for graphs with loops. Since we will not discuss 
these further here, see Bishop (2006) for a more comprehensive discussion in the 
context of machine learning. 


Conditional Independence 


An important concept for probability distributions over multiple variables is that of 
conditional independence (Dawid, 1980). Consider three variables a, b, and c, and 
suppose that the conditional distribution of a given b and c is such that it does not 
depend on the value of b, so that 


p(alb, c) = plaje). (11.22) 


We say that a is conditionally independent of b given c. This can be expressed in a 
slightly different way if we consider the joint distribution of a and b conditioned on 
c, which we can write in the form 


p(a, ble) = plalb, e)p(le) 
p(alc)p(dlc) (11.23) 


where we have used the product rule of probability together with (11.22). We see 
that, conditioned on c, the joint distribution of a and b factorizes into the product of 
the marginal distribution of a and the marginal distribution of b (again both condi- 
tioned on c). This says that the variables a and b are statistically independent, given 
c. Note that our definition of conditional independence will require that (11.22), or 
equivalently (11.23), must hold for every possible value of c, and not just for some 
values. We will sometimes use a shorthand notation for conditional independence 
(Dawid, 1979) in which 


all bje (11.24) 


denotes that a is conditionally independent of b given c. Conditional independence 
properties play an important role in probabilistic models for machine learning be- 
cause they simplify both the structure of a model and the computations needed to 
perform inference and learning under that model. 
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Figure 11.14 The first of three examples of graphs over three variables a, b, 


and c used to discuss conditional independence properties of Q 


directed graphical models. 


If we are given an expression for the joint distribution over a set of variables 
in terms of a product of conditional distributions (i.e., the mathematical representa- 
tion underlying a directed graph), then we could in principle test whether any po- 
tential conditional independence property holds by repeated application of the sum 
and product rules of probability. In practice, such an approach would be very time- 
consuming. An important and elegant feature of graphical models is that conditional 
independence properties of the joint distribution can be read directly from the graph 
without having to perform any analytical manipulations. The general framework 
for achieving this is called d-separation, where the ‘d’ stands for ‘directed’ (Pearl, 
1988). Here we will motivate the concept of d-separation and give a general state- 
ment of the d-separation criterion. A formal proof can be found in Lauritzen (1996). 


11.2.1 Three example graphs 


We begin our discussion of the conditional independence properties of directed 
graphs by considering three simple examples each involving graphs having just three 
nodes. Together, these will motivate and illustrate the key concepts of d-separation. 
The first of the three examples is shown in Figure 11.14, and the joint distribution 
corresponding to this graph is easily written down using the general result (11.6) to 
give 

p(a, b, c) = p(ale)p(b|c)p(o). (11.25) 
If none of the variables are observed, then we can investigate whether a and b are 
independent by marginalizing both sides of (11.25) with respect to c to give 


pla, b) = X p(ale)p(0le)p(e). (11.26) 


In general, this does not factorize into the product p(a)p(b), and so 
alt b| 0 (11.27) 


where Ø denotes the empty set, and the symbol X means that the conditional inde- 
pendence property does not hold in general. Of course, it may hold for a particular 
distribution by virtue of the specific numerical values associated with the various 
conditional probabilities, but it does not follow in general from the structure of the 
graph. 

Now suppose we condition on the variable c, as represented by the graph of 
Figure 11.15. From (11.25), we can easily write down the conditional distribution of 


Figure 11.15 


Figure 11.16 
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As in Figure 11.14 but where we have conditioned on the value © 


of variable c. 


a and b, given c, in the form 
p(a, 6, c) 


p(c) 
= p(alc)p(d|c) 


p(a, ble) 


and so we obtain the conditional independence property 
allb|c. 


We can provide a simple graphical interpretation of this result by considering 
the path from node a to node b via c. The node c is said to be tail-to-tail with respect 
to this path because the node is connected to the tails of the two arrows, and the 
presence of such a path connecting nodes a and b causes these nodes to be depen- 
dent. However, when we condition on node c, as in Figure 11.15, the conditioned 
node ‘blocks’ the path from a to b and causes a and b to become (conditionally) 
independent. 

We can similarly consider the graph shown in Figure 11.16. The joint distribu- 
tion corresponding to this graph is again obtained from our general formula (11.6) 
to give 

p(a, b,c) = pla)p(cļa)p(b|c). (11.28) 


First, suppose that none of the variables are observed. Again, we can test to see if a 
and b are independent by marginalizing over c to give 


pla, b) = pla) X` p(cla)p(ble) = pla)p(bla) 


which in general does not factorize into p(a)p(b), and so 
ajb] (11.29) 


as before. 
Now suppose we condition on node c, as shown in Figure 11.17. Using Bayes’ 


The second of our three examples of three-node graphs 
used to motivate the conditional independence frame- 


work for directed graphical models. 
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Figure 11.17 


Figure 11.18 


As in Figure 11.16 but now conditioning on node c. i \ 


theorem together with (11.28), we obtain 
pla, b,c) 
P(e) 
p(a)p(cla)p (alc) 
P(e) 
p(ale)p(lc) 
and so again we obtain the conditional independence property 


allbj|c. 


pla, b\c) 


As before, we can interpret these results graphically. The node c is said to be 
head-to-tail with respect to the path from node a to node b. Such a path connects 
nodes a and b and renders them dependent. If we now observe c, as in Figure 11.17, 
then this observation ‘blocks’ the path from a to b and so we obtain the conditional 
independence property a LL b | c. 

Finally, we consider the third of our three-node examples, shown by the graph in 
Figure 11.18. As we will see, this has a more subtle behaviour than the two previous 
graphs. The joint distribution can again be written down using our general result 
(11.6) to give 

p(a, b,c) = p(a)p(b)p(cla, b). (11.30) 
Consider first the case where none of the variables are observed. Marginalizing both 
sides of (11.30) over c we obtain 


p(a, b) = p(a)p(b) 


and so a and b are independent with no variables observed, in contrast to the two 
previous examples. We can write this result as 


a IL b| 0. (11.31) 

Now suppose we condition on c, as indicated in Figure 11.19. The conditional dis- 
tribution of a and b is then given by 

p(a, ble) = 


p(a)p(b)p(cla, b) 
p(c) 


The last of our three examples of three-node graphs used to ex- 
plore conditional independence properties in graphical models. 
This graph has rather different properties from the two previous 


examples. © 


Figure 11.19 


Exercise 11.13 


Figure 11.20 
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As in Figure 11.18 but conditioning on the value of node c. In 
this graph, the act of conditioning induces a dependence be- 
tween a and b. 


which in general does not factorize into the product p(a|c)p(b|c), and so 
alt ble. 


Thus, our third example has the opposite behaviour from the first two. Graphically, 
we say that node c is head-to-head with respect to the path from a to b because it 
connects to the heads of the two arrows. The node c is sometimes called a collider 
node. When node c is unobserved, it ‘blocks’ the path, and the variables a and b are 
independent. However, conditioning on c ‘unblocks’ the path and renders a and b 
dependent. 

There is one more subtlety associated with this third example that we need to 
consider. First we introduce some more terminology. We say that node y is a de- 
scendant of node x if there is a path from x to y in which each step of the path 
follows the directions of the arrows. Then it can be shown that a head-to-head path 
will become unblocked if either the node, or any of its descendants, is observed. 

In summary, a tail-to-tail node or a head-to-tail node leaves a path unblocked 
unless it is observed, in which case it blocks the path. By contrast, a head-to-head 
node blocks a path if it is unobserved, but once the node and/or at least one of its 
descendants is observed the path becomes unblocked. 


11.2.2 Explaining away 


It is worth spending a moment to understand further the unusual behaviour of the 
graph in Figure 11.19. Consider a particular instance of such a graph corresponding 
to a problem with three binary random variables relating to the fuel system on a car, 
as shown in Figure 11.20. The variables are B, which represents the state of a battery 
that is either charged (B = 1) or flat (B = 0), F which represents the state of the 
fuel tank that is either full of fuel (F = 1) or empty (F = 0), and G, which is the 
state of an electric fuel gauge and which indicates that the fuel tank is either full 
(G = 1) or empty (G = 0). The battery is either charged or flat, and independently, 


Ye ee 


An example of a three-node graph used to illustrate ‘explaining away’. The three nodes 
represent the state of the battery (B), the state of the fuel tank (F), and the reading on 
the electric fuel gauge (G). See the text for details. 
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the fuel tank is either full or empty, with prior probabilities 
p(B=1) = 09 
p(F=1) = 0.9. 


Given the state of the fuel tank and the battery, the fuel gauge reads full with proba- 
bilities given by 


p(G=1|B=1,F=1) 0.8 
p(G =1|B =1,F =0) 0.2 
p(G =1|B =0,F =1) 0.2 
p(G = 1|B =0,F =0) 0.1 


so this is a rather unreliable fuel gauge! All remaining probabilities are determined 
by the requirement that probabilities sum to one, and so we have a complete specifi- 
cation of the probabilistic model. 

Before we observe any data, the prior probability of the fuel tank being empty 
is p(F = 0) = 0.1. Now suppose that we observe the fuel gauge and discover that 
it reads empty, i.e., G = 0, corresponding to the middle graph in Figure 11.20. We 
can use Bayes’ theorem to evaluate the posterior probability of the fuel tank being 
empty. First we evaluate the denominator for Bayes’ theorem: 


p(G=0)= X` So p(G=0/B, F)p(B)p(F) = 0.315 (11.32) 
Be{0,1} F€{0,1} 
and similarly we evaluate 
p(G=0|F=0)= *_ p(G=0/B,F =0)p(B) = 0.81 (11.33) 
Be{0,1} 
and using these results, we have 


p(G = 0|F = 0)p(F = 0) 
p(G = 0) 


p(F = 0|G = 0) = ~ 0.257 (11.34) 
and so p(F = 0|G = 0) > p(F = 0). Thus, observing that the gauge reads empty 
makes it more likely that the tank is indeed empty, as we would intuitively expect. 
Next suppose that we also check the state of the battery and find that it is flat, i.e., 
B = 0. We have now observed the states of both the fuel gauge and the battery, as 
shown by the right-hand graph in Figure 11.20. The posterior probability that the 
fuel tank is empty given the observations of both the fuel gauge and the battery state 
is then given by 


p(G = 0|B = 0, F = 0)p(F = 0) 
re{o,1} P(C = 0|B = 0, F)p(F) 


where the prior probability p(B = 0) has cancelled between the numerator and 
denominator. Thus, the probability that the tank is empty has decreased (from 0.257 


p(F = 0|/G =0,B=0) ~ 0.111 (11.35) 


Figure 11.21 


Exercise 11.14 
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Illustration of d-separation. 
See the text for details. ©) 


(a) (b) 


to 0.111) as a result of the observation of the state of the battery. This accords with 
our intuition that finding that the battery is flat explains away the observation that the 
fuel gauge reads empty. We see that the state of the fuel tank and that of the battery 
have indeed become dependent on each other as a result of observing the reading on 
the fuel gauge. In fact, this would also be the case if, instead of observing the fuel 
gauge directly, we observed the state of some descendant of G, for example a rather 
unreliable witness who reports seeing that the gauge was reading empty. Note that 
the probability p(F = 0|G = 0, B = 0) œ 0.111 is greater than the prior probability 
p(F = 0) = 0.1 because the observation that the fuel gauge reads zero still provides 
some evidence in favour of an empty fuel tank. 


11.2.3 D-separation 


We now give a general statement of the d-separation property (Pearl, 1988) for 
directed graphs. Consider a general directed graph in which A, B, and C are arbi- 
trary non-intersecting sets of nodes (whose union may be smaller than the complete 
set of nodes in the graph). We wish to ascertain whether a particular conditional 
independence statement A 1L B | C is implied by a given directed acyclic graph. To 
do so, we consider all possible paths from any node in A to any node in B. Any such 
path is said to be blocked if it includes a node such that either 


(a) the arrows on the path meet either head-to-tail or tail-to-tail at the node, and the 
node is in the set C, or 


(b) the arrows meet head-to-head at the node and neither the node, nor any of its 
descendants is in the set C. 


If all paths are blocked, then A is said to be d-separated from B by C, and the joint 
distribution over all the variables in the graph will satisfy A LL B | C. 

D-separation is illustrated in Figure 11.21. In graph (a), the path from a to b is 
not blocked by node f because it is a tail-to-tail node for this path and is not observed, 
nor is it blocked by node e because, although the latter is a head-to-head node, it has 
a descendant c in the conditioning set. Thus, the conditional independence statement 
a 1L b | c does not follow from this graph. In graph (b), the path from a to bis blocked 
by node f because this is a tail-to-tail node that is observed, and so the conditional 
independence property a 1L b | f will be satisfied by any distribution that factorizes 
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Figure 11.22 A graphical representation of the naive Bayes model for clas- 


Section 2.3.2 


sification. Conditioned on the class label C, the elements of 
the observed vector x = (x“),...,x‘")) are assumed to be 
independent. 


according to this graph. Note that this path is also blocked by node e because e is 
a head-to-head node and neither it nor its descendant are in the conditioning set. In 
d-separation, parameters such as A in Figure 11.12, which are indicated by floating 
variables, behave in the same way as observed nodes. However, there are no marginal 
distributions associated with such nodes, and consequently parameter nodes never 
themselves have parents and so all paths through these nodes will always be tail-to- 
tail and hence blocked. Consequently they play no role in d-separation. 

Another example of conditional independence and d-separation is provided by 
i.i.d. (independent and identically distributed) data. Consider the binary classifica- 
tion model shown in Figure 11.12. Here the stochastic nodes correspond to {tn J 
w, and ¢. We see that the node for w is tail-to-tail with respect to the path from t 
to any one of the nodes f¢,,, and so we have the following conditional independence 
property: 

TIL th | w. (11.36) 


Thus, conditioned on the network parameters w, the predictive distribution for t is 
independent of the training data {t1, .. . , tu }. We can therefore first use the training 
data to determine the posterior distribution (or some approximation to the posterior 
distribution) over the coefficients w and then we can discard the training data and use 
the posterior distribution for w to make predictions of f for new input observations 
T. 


11.2.4 Naive Bayes 


A related graphical structure arises in an approach to classification called the 
naive Bayes model, in which we use conditional independence assumptions to sim- 
plify the model structure. Suppose our data consists of observations of a vector x, 
and we wish to assign values of x to one of K classes. We can define a class- 
conditional density p(x|C;) for each of the classes, along with prior class probabili- 
ties p(C;). The key assumption of the naive Bayes model is that, conditioned on the 
class Cx, the distribution of the input variable factorizes into the product of two or 
more densities. Suppose we partition x into L elements x = (x“),...,x4)). Naive 
Bayes then takes the form 


L 
p(x|Cx) = | J ve Ce) (11.37) 
t=1 


where it is assumed that (11.37) holds for each of the classes C, separately. The 
graphical representation of this model is shown a Figure 11.22. We see that an 
observation of Cy would block the path between x and x) for j 4 i because such 
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Figure 11.23 


Illustration of a naive Bayes classifier for a two-dimensional data space, showing (a) the condi- 


tional distributions p(x|C+) for each of the two classes and (b) the marginal distribution p(x) in which we have 
assumed equal class priors p(C1) = p(C2) = 0.5. Note that the conditional distributions factorize with respect to 
zı and x2, whereas the marginal distribution does not. 


Exercise 11.15 


paths are tail-to-tail at the node Cx, and so x and x) are conditionally independent 
given Cx. If, however, we marginalize out Cp, the tail-to-tail path from x to x) is 
no longer blocked, which tells us that in general the marginal density p(x) will not 
factorize with respect to the elements xD xl) 

If we are given a labelled training set, comprising observations {x,,...,x~} 
together with their class labels, then we can fit the naive Bayes model to the training 
data using maximum likelihood by assuming that the data are drawn independently 
from the model. The solution is obtained by fitting the model for each class sepa- 
rately using the corresponding labelled data and then setting the class priors p(Cx) 
equal to the fraction of training data points in each class. The probability that a 
vector x belongs to class C;, is then given by Bayes’ theorem in the form 


pCi Da) (11.38) 
P(x) 
where p(x|C;,) is given by (11.37), and p(x) can be evaluated using 
K 
p(x) = X p(x|Cx)p(Cx)- (11.39) 
k=1 


The naive Bayes model is illustrated for a two-dimensional data space in Fig- 
ure 11.23 in which x = (x1, £2). Here we assume that the conditional densities 
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p(x|C;,) for each of the two classes are axis-aligned Gaussians, and hence that they 
each factorize with respect to xı and x2 so that 


p(x|Ck) = p(z1|Ck)p(z2|Ck). (11.40) 


However, the marginal density p(x) given by 


K 
p(x) = X` p(x|Cr)p(Cr) (11.41) 
k=1 


is now a mixture of Gaussians and does not factorize with respect to x, and zə. 
We have already encountered a simple application of the naive Bayes model in the 
context of fusing data from different sources, such as blood tests and skin images for 
medical diagnosis. 

The naive Bayes assumption is helpful when the dimensionality D of the input 
space is high, making density estimation in the full D-dimensional space more chal- 
lenging. It is also useful if the input vector contains both discrete and continuous 
variables, since each can be represented separately using appropriate models (e.g., 
Bernoulli distributions for binary observations or Gaussians for real-valued vari- 
ables). The conditional independence assumption of this model is clearly a strong 
one that may lead to rather poor representations of the class-conditional densities. 
Nevertheless, even if this assumption is not precisely satisfied, the model may still 
give good classification performance in practice because the decision boundaries can 
be insensitive to some of the details in the class-conditional densities, as illustrated 
in Figure 5.8. 


11.2.5 Generative models 


Many applications of machine learning can be viewed as examples of inverse 
problems in which there is an underlying, often physical, process that generates data, 
and the goal is to learn now to invert this process. For example, an image of an 
object can be viewed as the output of a generative process in which the type of 
object is selected from some distribution of possible object classes. The position and 
orientation of the object are also chosen from some prior distributions, and then the 
resulting image is created. Given a large data set of images labelled with the type, 
position, and scale of the objects they contain, the goal is to train a machine learning 
model that can take new, unlabelled images and detect the presence of an object 
including its location within the image and its size. The machine learning solution 
therefore represents the inverse of the process that generated the data. 

One approach would be to train a deep neural network, such as a convolutional 
network, to take an image as input and to generate outputs that describe the object’s 
type, position, and scale. This approach therefore tries to solve the inverse problem 
directly and is an example of a discriminative model. It can achieve high accuracy 
provided ample examples of labelled images are available. In practice, unlabelled 
images are often plentiful, and much of the effort in obtaining a training set goes 
into proving the labels, which may be done by hand. Our simple discriminative 
model cannot directly make use of unlabelled images during training. 


Figure 11.24 A graphical model representing the process by which 
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osition 
images of objects are created. The identity of an object Glass 3 scale 


(a discrete variable) and the position and orientation 
of that object (continuous variables) have independent 
prior probabilities. The image (an array of pixel intensi- 
ties) has a probability distribution that is dependent on 
the identity of the object as well as on its position and 
orientation. 


image 


An alternative approach is to model the generative process and then subse- 
quently to invert it computationally. In our image example, if we assume that the 
object’s class, position, and scale are all chosen independently, then we can represent 
the generative process using a directed graphical model as shown in Figure 11.24. 
Note that the directions of the arrows correspond to the sequence of generative steps, 
and so the model represents the causal process (Pearl, 1988) by which the observed 
data is generated. This is an example of a generative model because once it is trained, 
it can be used to generate synthetic images by first selecting values for object’s class, 
position, and scale from the learned prior distributions and then subsequently sam- 
pling an image from the learned conditional distribution. We will later see how diffu- 
sion models and other generative models can synthesize impressive high-resolution 
images based on a textual description of the desired content and style of the image. 

The graph in Figure 11.24 assumes that, when no image is observed, the class, 
position, and scale variables are independent. This follows because every path be- 
tween any two of these variables is head-to-head with respect to the image variable, 
which is unobserved. However, when we observe an image, those paths become 
unblocked, and the class, position, and scale variables are no longer independent. 
Intuitively this is reasonable because being told the identity of the object within the 
image provides us with very relevant information to assist us with determining its 
location. 

The hidden variables in a probabilistic model need not, however, have any ex- 
plicit physical interpretation but may be introduced simply to allow a more complex 
joint distribution to be constructed from simpler components. For example, models 
such as normalizing flows, variational autoencoders, and diffusion models all use 
deep neural networks to create complex distributions in the data space by transform- 
ing hidden variables having a simple Gaussian distribution. 


11.2.6 Markov blanket 


A conditional independence property that is helpful when discussing more com- 
plex directed graphs is called the Markov blanket or Markov boundary. Consider 
a joint distribution p(x,,...,xp) represented by a directed graph having D nodes, 
and consider the conditional distribution of a particular node with variables x; con- 
ditioned on all the remaining variables x;z;. Using the factorization property (11.6), 
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Figure 11.25 The Markov blanket of a node x; comprises the 


set of parents, children, and co-parents of the 
node. It has the property that the conditional 


distribution of x;, conditioned on all the remain- 
ing variables in the graph, is dependent only on 


the variables in the Markov blanket. 


we can express this conditional distribution in the form 
PO easy XD) 


[ve peng KD) Ox; 
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k 


J TI oepa) a 
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in which the integral is replaced by a summation for discrete variables. We now 
observe that any factor p(x;,|pa(k)) that does not have any functional dependence 
on x; can be taken outside the integral over x; and will therefore cancel between 
numerator and denominator. The only factors that remain will be the conditional 
distribution p(x;|pa(i)) for node x; itself, together with the conditional distributions 
for any nodes x;, such that node x; is in the conditioning set of p(x;|pa(k)), in other 
words for which x; is a parent of x. The conditional p(x;|pa(i)) will depend on 
the parents of node x;, whereas the conditionals p(x;,|pa(k)) will depend on the 
children of x; as well as on the co-parents, in other words variables corresponding 
to parents of node x; other than node x;. The set of nodes comprising the parents, 
the children, and the co-parents is called the Markov blanket and is illustrated in 
Figure 11.25. 

We can think of the Markov blanket of a node x; as being the minimal set of 
nodes that isolates x; from the rest of the graph. Note that it is not sufficient to 
include only the parents and children of node x; because explaining away means 
that observations of the child nodes will not block paths to the co-parents. We must 
therefore observe the co-parent nodes as well. 


11.2.7 Graphs as filters 


We have seen that a particular directed graph represents a specific decomposi- 
tion of a joint probability distribution into a product of conditional probabilities, and 
it also expresses a set of conditional independence statements obtained through the 
d-separation criterion. The d-separation theorem is really an expression of the equiv- 
alence of these two properties. To make this clear, it is helpful to think of a directed 
graph as a filter. Suppose we consider a particular joint probability distribution p(x) 
over the variables x corresponding to the (unobserved) nodes of the graph. The fil- 
ter will allow this distribution to pass through if, and only if, it can be expressed in 


P(XilX {542} ) 


Figure 11.26 
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= p 


We can view a graphical model (in this case a directed graph) as a filter in which a prob- 
ability distribution p(x) is allowed through the filter if, and only if, it satisfies the directed 
factorization property (11.6). The set of all possible probability distributions p(x) that pass 
through the filter is denoted DF. We can alternatively use the graph to filter distributions 
according to whether they respect all the conditional independence properties implied by 
the d-separation properties of the graph. The d-separation theorem says the same set of 
distributions DF will be allowed through this second kind of filter. 


terms of the factorization (11.6) implied by the graph. If we present to the filter the 
set of all possible distributions p(x) over the set of variables x, then the subset of 
distributions that are passed by the filter is denoted DF, for directed factorization. 
This is illustrated in Figure 11.26. 

Alternatively, we can use the graph as a different kind of filter by first listing 
all the conditional independence properties obtained by applying the d-separation 
criterion to the graph and then allowing a distribution to pass only if it satisfies all of 
these properties. If we present all possible distributions p(x) to this second kind of 
filter, then the d-separation theorem tells us that the set of distributions that will be 
allowed through is precisely the set DF. 

It should be emphasized that the conditional independence properties obtained 
from d-separation apply to any probabilistic model described by that particular di- 
rected graph. This will be true, for instance, whether the variables are discrete or 
continuous or a combination of these. Again, we see that a particular graph describes 
a whole family of probability distributions. 

At one extreme, we have a fully connected graph that exhibits no conditional 
independence properties at all and which can represent any possible joint probability 
distribution over the given variables. The set DF will contain all possible distri- 
butions p(x). At the other extreme, we have a fully disconnected graph, i.e., one 
having no links at all. This corresponds to joint distributions that factorize into the 
product of the marginal distributions over the variables comprising the nodes of the 
graph. Note that for any given graph, the set of distributions DF will include any 
distributions that have additional independence properties beyond those described by 
the graph. For instance, a fully factorized distribution will always be passed through 
the filter implied by any graph over the corresponding set of variables. 


Sequence Models 


There are many important applications of machine learning in which the data consists 
of a sequence of values. For example, text comprises a sequence of words, whereas 
a protein comprises a sequence of amino acids. Many sequences are ordered by 
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Figure 11.27 An illustration of a general autoregressive 


model of the form (11.42) with four nodes. 


time, such as the audio signals from a microphone or daily rainfall measurements 
at a particular location. Sometimes the terminology of ‘time’ as well as ‘past’ and 
‘future’ are used when referring to other types of sequential data, not just temporal 
sequences. Applications involving sequences include speech recognition, automatic 
translation between languages, detecting genes in DNA, synthesizing music, writ- 
ing computer code, holding a conversation with a modern search engine, and many 
others. 

We will denote a data sequence by xi,...,xy where each element x,, of the 
sequence comprises a vector of values. Note that we might have several such se- 
quences drawn independently from the same distribution, in which case the joint 
distribution over all the sequences factorizes into the product of the distributions 
over each sequence individually. From now on, we focus on modelling just one of 
those sequences. 

We have already seen in (11.4) that by repeated application of the product rule 
of probability, a general distribution over N variables can be written as the product 
of conditional distributions, and that the form of this decomposition depends on a 
specific of ordering for the variables. For vector-valued variables, and if we chose 
an ordering that corresponds to the order of the variables in the sequence, then we 
can write 


N 
p(xX1,-.-,XnN) = EEA (11.42) 
n=1 


This corresponds to a directed graph in which each node receives a link from every 
previous node in the sequence, as illustrated using four variables in Figure 11.27. 
This is known as an autoregressive model. 

This representation has complete generality and therefore from a modelling per- 
spective adds no value since it encodes no assumptions. We can constrain the space 
of models by introducing conditional independence properties by removing links 
from the graph, or equivalently by removing variables from the conditioning set of 
the factors on the right-hand-side of (11.42). 

The strongest assumption would be to remove all conditioning variables, giving 
a joint distribution of the form 


N 
p(xX1,...,Xn) = J [ pn), (11.43) 
n=1 


which treats the variables as independent and therefore completely ignores the order- 
ing information. This corresponds to a probabilistic graphical model without links, 
as shown in Figure 11.28. 


Figure 11.28 
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The simplest approach to mod- 
elling a sequence of observations \o J 7 Veep Veep 
is to treat them as independent, 


corresponding to a probabilistic 
graphical model without links. 


Interesting models that capture sequential properties while introducing mod- 
elling assumptions lie between these two extremes. One strong assumption would 
be to assume that each conditional distribution depends only on the immediately 
preceding variable in the sequence, giving a joint distribution of the form 


N 
p(xi,---,*~) = p(x) | [ p<nlxn-1). (11.44) 
n=2 


Note that the first variable in the sequence is treated slightly differently since it has 
no conditioning variable. The functional form (11.44) is known as a Markov model, 
or Markov chain, and is represented by a graph consisting of a simple chain of nodes, 
as seen in Figure 11.29. Using d-separation, we see that the conditional distribution 
for observation x,,, given all of the observations up to time n, is given by 


P(Xn|X1; coe , Xn-1) = P(Xn|Xn-1), (11.45) 


which is easily verified by direct evaluation starting from (11.44) and using the prod- 
uct rule of probability. Thus, if we use such a model to predict the next observation 
in a sequence, the distribution of predictions will depend only on the value of the im- 
mediately preceding observation and will be independent of all earlier observations. 

More specifically, (11.44) is known as a first-order Markov model because only 
one conditioning variable appears in each conditional distribution. We can extend 
the model by allowing each conditional distribution to depend on the two preceding 
variables, giving a second-order Markov model of the form 


N 
p(X1,---,XN) = p(X1)p(X2|x1) II P(Xn|Xn-1; Xn—-2). (11.46) 


n=3 


Note that the first two variables are treated differently as they have fewer than two 
conditioning variables. This model is shown as a directed graph in Figure 11.30. 

By using d-separation (or by direct evaluation using the rules of probability), 
we see that in the second-order Markov model, the conditional distribution of x,, 
given all previous observations x,,...,X,—1 is independent of the observations 
X1,...Xp—3. We can similarly consider extensions to an M+} order Markov chain in 


A first-order Markov chain of ob- 
servations in which the distribu- Ga) (=) (as) (x) =" 
tion of a particular observation x, 


is conditioned on the value of the 
previous observation x,,_1. 
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Figure 11.30 A second-order Markov chain in 


Figure 11.31 


which the conditional distribution n 

of a particular observation x» de- Q OG D (x) os 
pends on the values of the two 

previous observations x,—1 and 

Xn—2. 


which the conditional distribution for a particular variable depends on the previous 
M variables. However, we have paid a price for this increased flexibility because the 
number of parameters in the model is now much larger. Suppose the observations 
are discrete variables having K states. Then the conditional distribution p(Xn|Xn—1) 
in a first-order Markov chain will be specified by a set of K — 1 parameters for 
each of the K states of x,_, giving a total of K(K — 1) parameters. Now suppose 
we extend the model to an M+: order Markov chain, so that the joint distribution is 
built up from conditionals p(xp|Xn—ar,---,Xn—1). If the variables are discrete and 
if the conditional distributions are represented by general conditional probability ta- 
bles, then such a model will have K™M-1(K — 1) parameters. Thus, the number of 
parameters grows exponentially with WM, which will generally render this approach 
impractical for larger values of M. 


11.3.1 Hidden variables 


Suppose we wish to build a model for sequences that is not limited by the 
Markov assumption to any order and yet can be specified using a limited number of 
free parameters. We can achieve this by introducing additional latent variables, thus 
permitting a rich class of models to be constructed out of simple components. For 
each observation x,,, we introduce a corresponding latent variable z, (which may be 
of different type or dimensionality to the observed variable). We now assume that it 
is the latent variables that form a Markov chain, giving rise to the graphical structure 
known as a state-space model, which is shown in Figure 11.31. It satisfies the key 
conditional independence property that z,,_, and z,,,, are independent given z,,, so 
that 

Zn+1 ll Gigi | Zn- (11.47) 


The joint distribution for this model is given by 


N N 
Dp eee Zig cscs ZN) Se) TI oo) II P(Xn|Zn). (11.48) 
n=2 


n=l 


A state-space model expresses the joint 
probability distribution over a sequence (2) 
of observed states xi,...,xw in terms 
of a Markov chain of hidden states 
Zı,..., ZN Ín the form (11.48). G 
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Using the d-separation criterion, we see that in the state-space model there 
is always a path connecting any two observed variables x, and Xm via the la- 
tent variables and that this path is never blocked. Thus, the predictive distribution 
P(Xn41|X1,---,Xn) for observation xX„+ı given all previous observations does not 
exhibit any conditional independence properties, and so our predictions for X„+1ı de- 
pend on all previous observations. The observed variables, therefore, do not satisfy 
the Markov property at any order. 

There are two important models for sequential data that are described by this 
graph. If the latent variables are discrete, then we obtain a hidden Markov model 
(Elliott, Aggoun, and Moore, 1995). Note that the observed variables in a hidden 
Markov model may be discrete or continuous, and a variety of different conditional 
distributions can be used to model them. If both the latent and the observed variables 
are Gaussian (with a linear-Gaussian dependence of the conditional distributions on 
their parents), then we obtain a linear dynamical system, also known as a Kalman 
filter (Zarchan and Musoff, 2005). Both hidden Markov models and Kalman filters 
are discussed at length, along with algorithms for training them, in Bishop (2006). 
Such models can be made considerably more flexible by replacing the simple discrete 
probability tables, or linear-Gaussian distributions, used to define p(xn|Zn) with 
deep neural networks. 


(x) By marginalizing out the variables in order, show that the representation (11.6) 
for the joint distribution of a directed graph is correctly normalized, provided each 
of the conditional distributions is normalized. 


(x) Show that the property of there being no directed cycles in a directed graph 
follows from the statement that there exists an ordered numbering of the nodes such 
that for each node there are no links going to a lower-numbered node. 


(x x) Consider three binary variables a,b,c € {0,1} having the joint distribution 
given in Table 11.1. Show by direct evaluation that this distribution has the property 
that a and b are marginally dependent, so that p(a,b) 4 n )p(b), but that they 
become independent when conditioned on c, so that p(a,b|c) = p(alc)p(b|c) for 
both c = 0 and c = 1. 


The joint distribution over three binary variables. pla, b,c) 
0.192 
0.144 
0.048 
0.216 
0.192 
0.064 
0.048 
0.096 
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11.4 


11.5 


11.6 


11.9 


11.10 


11.11 
11.12 


11.13 


(x x) Evaluate the distributions p(a), p(b|c), and p(c|a) corresponding to the joint 
distribution given in Table 11.1. Hence, show by direct evaluation that p(a, b,c) = 
p(a)p(cla)p(b|c). Draw the corresponding directed graph. 


(x) For the model shown in Figure 11.6, we have seen that the number of parameters 
required to specify the conditional distribution p(y|71,..., £m), where x; € {0, 1}, 
could be reduced from 2™ to M +1 by making use of the logistic sigmoid represen- 
tation (11.8). An alternative representation (Pearl, 1988) is given by 


M 
p(y = lar,...,0) =1- (1 po) | [0 — n)” (11.49) 


i=1 


where the parameters u; represent the probabilities p(x; = 1) and jug is an additional 
parameter satisfying 0 < uo < 1. The conditional distribution (11.49) is known as 
the noisy-OR. Show that this can be interpreted as a ‘soft’ (probabilistic) form of the 
logical OR function (i.e., the function that gives y = 1 whenever at least one of the 
x; = 1). Discuss the interpretation of 4o. 


(«x x) Starting from the definition (11.9) for the conditional distributions, derive the 
recursion relation (11.12) for the mean of the joint distribution for a linear-Gaussian 
model. 


(xx) Starting from the definition (11.9) for the conditional distributions, derive the 
recursion relation (11.13) for the covariance matrix of the joint distribution for a 
linear-Gaussian model. 


(x x) Show that the number of parameters in the covariance matrix of a fully con- 
nected linear-Gaussian graphical model over D variables defined by (11.9) is D(D+ 


1)/2. 


(x x) Using the recursion relations (11.12) and (11.13), show that the mean and co- 
variance of the joint distribution for the graph shown in Figure 11.7 are given by 
(11.14) and (11.15), respectively. 


(x) Verify that the joint distribution over a set of vector-valued variables defined by a 
linear-Gaussian model in which each node corresponds to a distribution of the form 
(11.16) is itself a Gaussian. 


(x) Show that a IL b,c | dimplies a LL b | d. 


(x) Using the d-separation criterion, show that the conditional distribution for a node 
x in a directed graph, conditioned on all the nodes in the Markov blanket, is inde- 
pendent of the remaining variables in the graph. 


(x) Consider the directed graph shown in Figure 11.32 in which none of the variables 
is observed. Show that a IL b | Ø. Suppose we now observe the variable d. Show 
that in general a 1 b | d. 


Figure 11.32 


11.14 


11.15 


11.16 


11.17 
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Example of a graphical model used to explore the conditional 
independence properties of the head-to-head path a—c—b when 
a descendant of c, namely the node d, is observed. 


(x x) Consider the example of the car fuel system shown in Figure 11.20, and suppose 
that instead of observing the state of the fuel gauge G directly, the gauge is seen by 
the driver D, who reports to us the reading on the gauge. This report says that the 
gauge shows either that the tank is full D = 1 or that it is empty D = 0. Our driver 
is a bit unreliable, as expressed through the following probabilities: 


p(D=1|G=1) = 09 (11.50) 
p(D=0/G=0) = 0.9. (11.51) 


Suppose that the driver tells us that the fuel gauge shows empty, in other words 
that we observe D = 0. Evaluate the probability that the tank is empty given only 
this observation. Similarly, evaluate the corresponding probability given also the 
observation that the battery is flat, and note that this second probability is lower. 
Discuss the intuition behind this result, and relate the result to Figure 11.32. 


(xx) Suppose we train a naive Bayes model, with the assumption (11.37), using 
maximum likelihood. Assume that each of the class-conditional densities p(x |C;,) 
is governed by its own independent parameters w. Show that the maximum like- 
lihood solution involves fitting each of the class-conditional densities using the cor- 
responding observed data vectors x”, pri ,x® by maximizing the likelihood with 
respect to the corresponding class label data, and then setting the class priors p(Cx) 


to the fraction of training data points in each class. 


(x x) Consider the joint probability distribution (11.44) corresponding to the directed 
graph of Figure 11.29. Using the sum and product rules of probability, verify that 
this joint distribution satisfies the conditional independence property (11.45) for n = 
2,...,N. Similarly, show that the second-order Markov model described by the joint 
distribution (11.46) satisfies the conditional independence property 


Dl Kl ay cng Kyat) = P(Xn|Xn-1; Xn—2) (11.52) 
for n = 3,...,N. 


(x) Use d-separation, as discussed in Section 11.2, to verify that the Markov model 
shown in Figure 11.29 having N nodes in total satisfies the conditional independence 
properties (11.45) for n = 2,..., N. Similarly, show that a model described by the 
graph in Figure 11.30 in which there are N nodes in total satisfies the conditional 
independence properties (11.52) for n = 3,..., N. 
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11.18 


11.19 


(x) Consider a second-order Markov process described by the graph in Figure 11.30. 
By combining adjacent pairs of variables, show that this can be expressed as a first- 
order Markov process over the new variables. 


(x) By using d-separation, show that the distribution p(x1,...,x) of the observed 
data for the state-space model represented by the directed graph in Figure 11.31 does 
not satisfy any conditional independence properties and hence does not exhibit the 
Markov property at any finite order. 


Check for 


updates 


12 


Transformers 


Transformers represent one of the most important developments in deep learning. 
They are based on a processing concept called attention, which allows a network to 
give different weights to different inputs, with weighting coefficients that themselves 
depend on the input values, thereby capturing powerful inductive biases related to 
sequential and other forms of data. 

These models are known as transformers because they transform a set of vec- 
tors in some representation space into a corresponding set of vectors, having the 
same dimensionality, in some new space. The goal of the transformation is that the 
new space will have a richer internal representation that is better suited to solving 
downstream tasks. Inputs to a transformer can take the form of unstructured sets 
of vectors, ordered sequences, or more general representations, giving transformers 
broad applicability. 

Transformers were originally introduced in the context of natural language pro- 
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cessing, or NLP (where a ‘natural’ language is one such as English or Mandarin) and 
have greatly surpassed the previous state-of-the-art approaches based on recurrent 
neural networks (RNNs). Transformers have subsequently been found to achieve 
excellent results in many other domains. For example, vision transformers often 
outperform CNNs in image processing tasks, whereas multimodal transformers that 
combine multiple types of data, such as text, images, audio, and video, are amongst 
the most powerful deep learning models. 

One major advantage of transformers is that transfer learning is very effective, so 
that a transformer model can be trained on a large body of data and then the trained 
model can be applied to many downstream tasks using some form of fine-tuning. A 
large-scale model that can subsequently be adapted to solve multiple different tasks 
is known as a foundation model. Furthermore, transformers can be trained in a self- 
supervised way using unlabelled data, which is especially effective with language 
models since transformers can exploit vast quantities of text available from the inter- 
net and other sources. The scaling hypothesis asserts that simply by increasing the 
scale of the model, as measured by the number of learnable parameters, and train- 
ing on a commensurately large data set, significant improvements in performance 
can be achieved, even with no architectural changes. Moreover, the transformer is 
especially well suited to massively parallel processing hardware such as graphical 
processing units, or GPUs, allowing exceptionally large neural network language 
models having of the order of a trillion (1017) parameters to be trained in reason- 
able time. Such models have extraordinary capabilities and show clear indications 
of emergent properties that have been described as the early signs of artificial general 
intelligence (Bubeck et al., 2023). 

The architecture of a transformer can seem complex, or even daunting, to a 
newcomer as it involves multiple different components working together, in which 
the various design choices can seem arbitrary. In this chapter we therefore aim to give 
a comprehensive step-by-step introduction to all the key ideas behind transformers 
and to provide clear intuition to motivate the design of the various elements. We first 
describe the transformer architecture and then focus on natural language processing, 
before exploring other application domains. 


Attention 


The fundamental concept that underpins a transformer is attention. This was orig- 
inally developed as an enhancement to RNNs for machine translation (Bahdanau, 
Cho, and Bengio, 2014). However, Vaswani et al. (2017) later showed that signifi- 
cantly improved performance could be obtained by eliminating the recurrence struc- 
ture and instead focusing exclusively on the attention mechanism. Today, transform- 
ers based on attention have completely superseded RNNs in almost all applications. 

We will motivate the use of attention using natural language as an example, 
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Figure 12.1 Schematic illustration of attention in which the interpretation of the word ‘bank’ is influenced by the 
words ‘river’ and ‘swam’, with the thickness of each line being indicative of the strength of its influence. 


although it has much broader applicability. Consider the following two sentences: 


| swam across the river to get to the other bank. 
| walked across the road to get cash from the bank. 


Here the word ‘bank’ has different meanings in the two sentences. However, this 
can be detected only by looking at the context provided by other words in the se- 
quence. We also see that some words are more important than others in determining 
the interpretation of ‘bank’. In the first sentence, the words ‘swam’ and ‘river’ most 
strongly indicate that ‘bank’ refers to the side of a river, whereas in the second sen- 
tence, the word ‘cash’ is a strong indicator that ‘bank’ refers to a financial institution. 
We see that to determine the appropriate interpretation of ‘bank’, a neural network 
processing such a sentence should attend to, in other words rely more heavily on, 
specific words from the rest of the sequence. This concept of attention is illustrated 
in Figure 12.1. 

Moreover, we also see that the particular locations that should receive more 
attention depend on the input sequence itself: in the first sentence it is the second and 
fifth words that are important whereas in the second sentence it is the eighth word. 
In a standard neural network, different inputs will influence the output to different 
extents according to the values of the weights that multiply those inputs. Once the 
network is trained, however, those weights, and their associated inputs, are fixed. 
By contrast, attention uses weighting factors whose values depend on the specific 
input data. Figure 12.2 shows the attention weights from a section of a transformer 
network trained on natural language. 

When we discuss natural language processing, we will see how word embed- 
ding can be used to map words into vectors in an embedding space. These vectors 
can then be used as inputs for subsequent neural network processing. These embed- 
dings capture elementary semantic properties, for example by mapping words with 
similar meanings to nearby locations in the embedding space. One characteristic of 
such embeddings is that a given word always maps to the same embedding vector. 
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An example of learned attention weights. [From Vaswani et al. (2017) with permission.] 


A transformer can be viewed as a richer form of embedding in which a given vector 
is mapped to a location that depends on the other vectors in the sequence. Thus, 
the vector representing ‘bank’ in our example above could map to different places 
in a new embedding space for the two different sentences. For example, in the first 
sentence the transformed representation might put ‘bank’ close to ‘water’ in the em- 
bedding space, whereas in the second sentence the transformed representation might 
put it close to ‘money’. 

As an example of attention, consider the modelling of proteins. We can view 
a protein as a one-dimensional sequence of molecular units called amino acids. A 
protein can comprise potentially hundreds or thousands of such units, each of which 
is given by one of 22 possibilities. In a living cell, a protein folds up into a three- 
dimensional structure in which amino acids that are widely separated in the one- 
dimensional sequence can become physically close in three-dimensional space and 
thereby interact. Transformer models allows these distant amino acids to ‘attend’ to 
each other thereby greatly improving the accuracy with which their 3-dimensional 
structure can be modelled (Vig et al., 2020). 


12.1.1 Transformer processing 


The input data to a transformer is a set of vectors {Xn } of dimensionality D, 
where n = 1,..., N. We refer to these data vectors as tokens, where a token might, 
for example, correspond to a word within a sentence, a patch within an image, or 
an amino acid within a protein. The elements £n; of the tokens are called features. 
Later we will see how to construct these token vectors for natural language data and 
for images. A powerful property of transformers is that we do not have to design a 
new neural network architecture to handle a mix of different data types but instead 
can simply combine the data variables into a joint set of tokens. 

Before we can gain a clear understanding of the operation of a transformer, it 


Figure 12.3 The structure of the data matrix X, of di- 
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mension N x D, in which row n repre- 
sents the transposed data vector x2. 


D (features) 


is important to be precise about notation. We will follow the standard convention 
and combine the data vectors into a matrix X of dimensions N x D in which the 
nth row comprises the token vector xT, and where n = 1,..., N labels the rows, 
as illustrated in Figure 12.3. Note that this matrix represents one set of input tokens, 
and that for most applications, we will require a data set containing many sets of 
tokens, such as independent passages of text where each word is represented as one 
token. The fundamental building block of a transformer is a function that takes a 
data matrix as input and creates a transformed matrix X of the same dimensionality 
as the output. We can write this function in the form 


X = TransformerLayer [X] . (12.1) 


We can then apply multiple transformer layers in succession to construct deep net- 
works capable of learning rich internal representations. Each transformer layer con- 
tains its own weights and biases, which can be learned using gradient descent using 
an appropriate cost function, as we will discuss in detail later in the chapter. 

A single transformer layer itself comprises two stages. The first stage, which im- 
plements the attention mechanism, mixes together the corresponding features from 
different token vectors across the columns of the data matrix, whereas the second 
stage then acts on each row independently and transforms the features within each 
token vector. We start by looking at the attention mechanism. 


12.1.2 Attention coefficients 


Suppose that we have a set of input tokens x,,..., x, in an embedding space 
and we want to map this to another set y1, . . . , y y having the same number of tokens 
but in a new embedding space that captures a richer semantic structure. Consider a 
particular output vector yn. The value of yn should depend not just on the corre- 
sponding input vector x,, but on all the vectors x1,..., Xy in the set. With attention, 
this dependence should be stronger for those inputs x,, that are particularly impor- 
tant for determining the modified representation of y,,. A simple way to achieve this 
is to define each output vector yn to be a linear combination of the input vectors 
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X1,.-.,Xy with weighting coefficients anm: 
N 
Ya = Gate (12.2) 
m=1 


where anm are called attention weights. The coefficients should be close to zero for 
input tokens that have little influence on the output yn and largest for inputs that 
have most influence. We therefore constrain the coefficients to be non-negative to 
avoid situations in which one coefficient can become large and positive while another 
coefficient compensates by becoming large and negative. We also want to ensure that 
if an output pays more attention to a particular input, this will be at the expense of 
paying less attention to the other inputs, and so we constrain the coefficients to sum 
to unity. Thus, the weighting coefficients must satisfy the following two constraints: 


anm =O (12.3) 
N 

2 a= i (12.4) 

m=1 


Together these imply that each coefficient lies in the range 0 < anm < 1 and so the 
coefficients define a ‘partition of unity’. For the special case amm = 1, it follows that 
anm = 0 for n Æ m, and therefore ym = Xm so that the input vector is unchanged 
by the transformation. More generally, the output ym is a blend of the input vectors 
with some inputs given more weight than others. 

Note that we have a different set of coefficients for each output vector y,,, and 
the constraints (12.3) and (12.4) apply separately for each value of n. These co- 
efficients anm depend on the input data, and we will shortly see how to calculate 
them. 


12.1.3 Self-attention 


The next question is how to determine the coefficients &nm. Before we discuss 
this in detail, it is useful to first introduce some terminology taken from the field of 
information retrieval. Consider the problem of choosing which movie to watch in 
an online movie streaming service. One approach would be to associate each movie 
with a list of attributes describing things such as the genre (comedy, action, etc.), the 
names of the leading actors, the length of the movie, and so on. The user could then 
search through a catalogue to find a movie that matches their preferences. We could 
automate this by encoding the attributes of each movie in a vector called the key. 
The corresponding movie file itself is called a value. Similarly, the user could then 
provide their own personal vector of values for the desired attributes, which we call 
the query. The movie service could then compare the query vector with all the key 
vectors to find the best match and send the corresponding movie to the user in the 
form of the value file. We can think of the user ‘attending’ to the particular movie 
whose key most closely matches their query. This would be considered a form of 
hard attention in which a single value vector is returned. For the transformer, we 
generalize this to soft attention in which we use continuous variables to measure 


Section 5.3 


Exercise 12.3 


12.1. Attention 363 


the degree of match between queries and keys and we then use these variables to 
weight the influence of the value vectors on the outputs. This will also ensure that 
the transformer function is differentiable and can therefore be trained by gradient 
descent. 

Following the analogy with information retrieval, we can view each of the input 
vectors X,, as a value vector that will be used to create the output tokens. We also use 
the vector Xn directly as the key vector for input token n. That would be analogous 
to using the movie itself to summarize the characteristics of the movie. Finally, we 
can use Xm as the query vector for output ym, which can then be compared to each 
of the key vectors. To see how much the token represented by x,, should attend to 
the token represented by Xm, we need to work out how similar these vectors are. 
One simple measure of similarity is to take their dot product x/'x,,. To impose the 
constraints (12.3) and (12.4), we can define the weighting coefficients anm by using 
the softmax function to transform the dot products: 


T 
ipa ee (12.5) 


N 
ee exp(x] Xn ) 


Note that in this case there is no probabilistic interpretation of the softmax function 
and it is simply being used to normalize the attention weights appropriately. 

So in summary, each input vector x,, is transformed to a corresponding output 
vector yn by taking a linear combination of input vectors of the form (12.2) in which 
the weight anm applied to input vector Xm is given by the softmax function (12.5) 
defined in terms of the dot product xT xm between the query x,, for input n and the 
key Xm associated with input m. Note that, if all the input vectors are orthogonal, 
then each output vector is simply equal to the corresponding input vector so that 
Ym = Xm form =1,...,N. 

We can write (12.2) in matrix notation by using the data matrix X, along with 
the analogous N x D output matrix Y, whose rows are given by Ym, so that 


Y = Softmax [XXT] X (12.6) 


where Softmax[L] is an operator that takes the exponential of every element of a 
matrix L and then normalizes each row independently to sum to one. From now on, 
we will focus on matrix notation for clarity. 

This process is called self-attention because we are using the same sequence to 
determine the queries, keys, and values. We will encounter variants of this attention 
mechanism later in this chapter. Also, because the measure of similarity between 
query and key vectors is given by a dot product, this is known as dot-product self- 
attention. 


12.1.4 Network parameters 


As it stands, the transformation from input vectors {x,,} to output vectors {yn } 
is fixed and has no capacity to learn from data because it has no adjustable parame- 
ters. Furthermore, each of the feature values within a token vector x, plays an equal 
role in determining the attention coefficients, whereas we would like the network to 
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have the flexibility to focus more on some features than others when determining 
token similarity. We can address both issues if we define modified feature vectors 
given by a linear transformation of the original vectors in the form 


X = XU (12.7) 


where U is a D x D matrix of learnable weight parameters, analogous to a ‘layer’ 
in a standard neural network. This gives a modified transformation of the form 


Y = Softmax |[XUUTXT] XU. (12.8) 
Although this has much more flexibility, it has the property that the matrix 
XUUTXT (12.9) 


is symmetric, whereas we would like the attention mechanism to support significant 
asymmetry. For example, we might expect that ‘chisel’ should be strongly associ- 
ated with ‘tool’ since every chisel is a tool, whereas ‘tool’ should only be weakly 
associated with ‘chisel’ because there are many other kinds of tools besides chis- 
els. Although the softmax function means the resulting matrix of attention weights 
is not itself symmetric, we can create a much more flexible model by allowing the 
queries and the keys to have independent parameters. Furthermore, the form (12.8) 
uses the same parameter matrix U to define both the value vectors and the attention 
coefficients, which again seems like an undesirable restriction. 

We can overcome these limitations by defining separate query, key, and value 
matrices each having their own independent linear transformations: 


Q=xw (12.10) 
K = xw*) (12.11) 
V=xw” (12.12) 


where the weight matrices W‘), W“), and W“ represent parameters that will 
be learned during the training of the final transformer architecture. Here the matrix 
W'*) has dimensionality D x Dy where Dy is the length of the key vector. The 
matrix W‘ must have the same dimensionality D x D, as W“*) so that we can 
form dot products between the query and key vectors. A typical choice is Dy = D. 
Similarly, WC) is a matrix of size D x D,, where D, governs the dimensionality of 
the output vectors. If we set D, = D, so that the output representation has the same 
dimensionality as the input, this will facilitate the inclusion of residual connections, 
which we discuss later. Also, multiple transformer layers can be stacked on top of 
each other if each layer has the same dimensionality. We can then generalize (12.6) 
to give 

Y = Softmax [QKT] V (12.13) 


where QK™ has dimension N x N, and the matrix Y has dimension N x Dy. The 
calculation of the matrix QKT is illustrated in Figure 12.4, whereas the evaluation 
of the matrix Y is illustrated in Figure 12.5. 
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Illustration of the evaluation of the matrix QK™, which determines the attention coeffi- 
cients in a transformer. The input X is separately transformed using (12.10) and (12.11) 
to give the query matrix Q and key matrix K, respectively, which are then multiplied to- 
gether. 


In practice we can also include bias parameters in these linear transformations. 
However, the bias parameters can be absorbed into the weight matrices, as we did 
with standard neural networks, by augmenting the data matrix X with an additional 
column of 1’s and by augmenting the weight matrices with an additional row of 
parameters to represent the biases. From now on we will treat the bias parameters as 
implicit to avoid cluttering the notation. 

Compared to a conventional neural network, the signal paths have multiplicative 
relations between activation values. Whereas standard networks multiply activations 
by fixed weights, here the activations are multiplied by the data-dependent attention 
coefficients. This means, for example, that if one of the attention coefficients is 
close to zero for a particular choice of input vector, the resulting signal path will 
ignore the corresponding incoming signal, which will therefore have no influence 


Figure 12.5 Illustration of the evaluation 


of the output from an attention layer given E 

the query, key, and value matrices Q, O a 

K, and V, respectively. The entry at 

the position highlighted in the output ma- Y = Softmax Qk" x W 


trix Y is obtained from the dot prod- 
uct of the highlighted row and column 
of the Softmax [QKT] and V matrices, 


respectively. 
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Information flow in a scaled dot- 
product self-attention neural network 
layer. Here ‘mat mul’ denotes matrix 
multiplication, and ‘scale’ refers to the 
normalization of the argument to the 
softmax using /D,. This structure 
constitutes a single attention ‘head’. 


on the network outputs. By contrast, if a standard neural network learns to ignore a 
particular input or hidden-unit variable, it does so for all input vectors. 


12.1.5 Scaled self-attention 


There is one final refinement we can make to the self-attention layer. Recall that 
the gradients of the softmax function become exponentially small for inputs of high 
magnitude, just as happens with tanh or logistic-sigmoid activation functions. To 
help prevent this from happening, we can re-scale the product of the query and key 
vectors before applying the softmax function. To derive a suitable scaling, note that 
if the elements of the query and key vectors were all independent random numbers 
with zero mean and unit variance, then the variance of the dot product would be Dx. 
We therefore normalize the argument to the softmax using the standard deviation 
given by the square root of D,, so that the output of the attention layer takes the 
form 


KT 
Y = Attention (Q, K, V) = Softmax E | V. (12.14) 
v Dk 
This is called scaled dot-product self-attention, and is the final form of our self- 
attention neural network layer. The structure of this layer is summarized in Fig- 
ure 12.6 and in Algorithm 12.1. 


12.1.6 Multi-head attention 


The attention layer described so far allows the output vectors to attend to data- 
dependent patterns of input vectors and is called an attention head. However, there 
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Algorithm 12.1: Scaled dot-product self-attention 


Input: Set of tokens X € RY”? : {x,,...,xw} 
Weight matrices (W , W%} € RP*P« and WO) € RPX Dw 
Output: Attention(Q, K, V) € RYD» : {y,,...,yn} 
Q= XW // compute queries QER © 
K = XW) // compute keys K e RN*?« 
V=XW0 // compute values V € RN*? 


return Attention(Q, K, V) = Softmax = 


might be multiple patterns of attention that are relevant at the same time. In natu- 
ral language, for example, some patterns might be relevant to tense whereas others 
might be associated with vocabulary. Using a single attention head can lead to av- 
eraging over these effects. Instead we can use multiple attention heads in parallel. 
These consist of identically structured copies of the single head, with independent 
learnable parameters that govern the calculation of the query, key, and value matri- 
ces. This is analogous to using multiple different filters in each layer of a convolu- 
tional network. 
Suppose we have H heads indexed by h = 1,..., H of the form 


H, = Attention(Q;, Kn, Vh) (12.15) 


where Attention(-,-,-) is given by (12.14), and we have defined separate query, key, 
and value matrices for each head using 


Qr = Xw (12.16) 
Kr, = XW (12.17) 
Vr, = XW, (12.18) 


The heads are first concatenated into a single matrix, and the result is then linearly 
transformed using a matrix W (°) to give a combined output in the form 


Y(X) = Concat [H;,..., Hy] W. (12.19) 


This is illustrated in Figure 12.7. 

Each matrix H, has dimension N x D,, and so the concatenated matrix has 
dimension N x HD,,. This is transformed by the linear matrix W°) of dimension 
HD, x D to give the final output matrix Y of dimension N x D, which is the same 
as the original input matrix X. The elements of the matrix W°) are learned during 
the training phase along with the query, key, and value matrices. Typically D, is 
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head attention. Each head com- 
prises the structure shown in Fig- 
ure 12.6, and has its own key, 


query, and value parameters. The Ha Ee Haj x | Ww) = Y 

outputs of the heads are con- 

catenated and then linearly pro- 

jected back to the input data 

dimensionality. Nx HD, NxD 
AD, x D 


chosen to be equal to D/H so that the resulting concatenated matrix has dimension 
N x D. Multi-head attention is summarized in Algorithm 12.2, and the information 
flow in a multi-head attention layer is illustrated in Figure 12.8. 

Note that the formulation of multi-head attention given above, which follows 
that used in the research literature, includes some redundancy in the successive mul- 
tiplication of the W) matrix for each head and the output matrix W°). Removing 
this redundancy allows a multi-head self-attention layer to be written as a sum over 
contributions from each of the heads separately. 


12.1.7 Transformer layers 


Multi-head self-attention forms the core architectural element in a transformer 
network. We know that neural networks benefit greatly from depth, and so we would 
like to stack multiple self-attention layers on top of each other. To improve training 


Algorithm 12.2: Multi-head attention 


Input: Set of tokens X € RN*? : {x,,...,xw} 
Query weight matrices {W‘”,..., WS} € RP*P 
Key weight matrices fw), ae wi} ERE 2 
Value weight matrices {W $", ... , WS)} € RP*P» 
Output weight matrix W € RĦ#PvxD 

Output: Y € RN”? : {y1,..., XN} 


// compute self-attention for each head (Algorithm 12.1) 
forh=1,...,H do 
Q, =Xw, K,=Xw, v,=xw” 
H; = Attention (Qa, Kn, Vn) // Hn E RY** 
end for 
H = Concat (Hi, ies Hy] // concatenate heads 
return Y(X) = HW) 


Figure 12.8 
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concat 


X 


Information flow in a multi-head attention layer. The associated computation, given by 
Algorithm 12.2, is illustrated in Figure 12.7. 


efficiency, we can introduce residual connections that bypass the multi-head struc- 
ture. To do this we require that the output dimensionality is the same as the input 
dimensionality, namely N x D. This is then followed by layer normalization (Ba, 
Kiros, and Hinton, 2016), which improves training efficiency. The resulting trans- 
formation can be written as 


Z = LayerNorm [Y (X) + X] (12.20) 


where Y is defined by (12.19). Sometimes the layer normalization is replaced by 
pre-norm in which the normalization layer is applied before the multi-head self- 
attention instead of after, as this can result in more effective optimization, in which 
case we have 


Z=Y(X')+X, where X’ = LayerNorm |X]. (12.21) 


In each case, Z again has the same dimensionality N x D as the input matrix X. 
We have seen that the attention mechanism creates linear combinations of the 
value vectors, which are then linearly combined to produce the output vectors. Also, 
the values are linear functions of the input vectors, and so we see that the outputs 
of an attention layer are constrained to be linear combinations of the inputs. Non- 
linearity does enter through the attention weights, and so the outputs will depend 
nonlinearly on the inputs via the softmax function, but the output vectors are still 
constrained to lie in the subspace spanned by the input vectors and this limits the 
expressive capabilities of the attention layer. We can enhance the flexibility of the 
transformer by post-processing the output of each layer using a standard nonlinear 
neural network with D inputs and D outputs, denoted MLP[-] for ‘multilayer per- 
ceptron’. For example, this might consist of a two-layer fully connected network 
with ReLU hidden units. This needs to be done in a way that preserves the ability 
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Figure 12.9 One layer of the transformer architecture that ~ 


implements the transformation (12.1). Here x 
‘MLP’ stands for multilayer perceptron, while 


‘add and norm’ denotes a residual connection 
followed by layer normalization. a 0 


of the transformer to process sequences of variable length. To achieve this, the same 
shared network is applied to each of the output vectors, corresponding to the rows of 
Z. Again, this neural network layer can be improved by using a residual connection. 
It also includes layer normalization so that the final output from the transformer layer 
has the form x 

X = LayerNorm [MLP [Z] + Z]. (12.22) 


This leads to an overall architecture for a transformer layer shown in Figure 12.9 and 
summarized in Algorithm 12.3. Again, we can use a pre-norm instead, in which case 
the final output is given by 


X =MLP(Z’)+Z, where Z/ = LayerNorm [Z]. (12.23) 


In a typical transformer there are multiple such layers stacked on top of each other. 
The layers generally have identical structures, although there is no sharing of weights 
and biases between different layers. 


12.1.8 Computational complexity 


The attention layer discussed so far takes a set of N vectors each of length 
D and maps them into another set of N vectors having the same dimensionality. 
Thus, the inputs and outputs each have overall dimensionality N D. If we had used a 
standard fully connected neural network to map the input values to the output values, 
it would have O(N? D?) independent parameters. Likewise the computational cost 
of evaluating one forward pass through such a network would also be O(N? D?). 

In the attention layer, the matrices WwW, W), and WC are shared across in- 
put tokens, and therefore the number of independent parameters is O( D7), assuming 
Dyk ~ Dy œ D. Since there are N input tokens, the number of computational steps 
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Algorithm 12.3: Transformer layer 


Input: Set of tokens X € RN*? : {x1,...,xw} 
Multi-head self-attention layer parameters 
Feed-forward network parameters 


Output: X € RY*P : {X,,...,Xy} 


Z = LayerNorm [Y (X) +X] // Y(X) from Algorithm 12.2 
X= LayerNorm [MLP [Z] TE Z] // shared neural network 
return X 


in evaluating the dot products in a self-attention layer is O(N? D). We can think 
of a self-attention layer as a sparse matrix in which parameters are shared between 
specific blocks of the matrix. The subsequent neural network layer, which has D 
inputs and D outputs, has a cost that is O(D?). Since it is shared across tokens, it 
has a complexity that is linear in JV, and therefore overall this layer has a cost that is 
O(N D?). Depending on the relative sizes of N and D, either the transformer layer 
or the MLP layer may dominate the computational cost. Compared to a fully con- 
nected network, a transformer layer is computationally more efficient. Many vari- 
ants of the transformer architecture have been proposed (Lin et al., 2021; Phuong 
and Hutter, 2022) including modifications aimed at improving efficiency (Tay et al., 
2020). 


12.1.9 Positional encoding 


In the transformer architecture, the matrices wi? : wi, and wo are shared 
across the input tokens, as is the subsequent neural network. As a consequence, the 
transformer has the property that permuting the order of the input tokens, i.e., the 
rows of X, results in the same permutation of the rows of the output matrix X. In 
other words a transformer is equivariant with respect to input permutations. The 
sharing of parameters in the network architecture facilitates the massively parallel 
processing of the transformer, and also allows the network to learn long-range de- 
pendencies just as effectively as short-range dependencies. However, the lack of 
dependence on token order becomes a major limitation when we consider sequential 
data, such as the words in a natural language, because the representation learned by 
a transformer will be independent of the input token ordering. The two sentences 
‘The food was bad, not good at all.’ and ‘The food was good, not bad at all.’ con- 
tain the same tokens but they have very different meanings because of the different 
token ordering. Clearly token order is crucial for most sequential processing tasks 
including natural language processing, and so we need to find a way to inject token 
order information into the network. 

Since we wish to retain the powerful properties of the attention layers that we 
have carefully constructed, we aim to encode the token order in the data itself in- 
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stead of having to be represented in the network architecture. We will therefore 
construct a position encoding vector r„ associated with each input position n and 
then combine this with the associated input token embedding x,,. One obvious way 
to combine these vectors would be to concatenate them, but this would increase the 
dimensionality of the input space and hence of all subsequent attention spaces, cre- 
ating a significant increase in computational cost. Instead, we can simply add the 
position vectors onto the token vectors to give 


Xn = Xn + rn. (12.24) 


This requires that the positional encoding vectors have the same dimensionality as 
the token-embedding vectors. 

At first it might seem that adding position information onto the token vector 
would corrupt the input vectors and make the task of the network much more diffi- 
cult. However, some intuition as to why this can work well comes from noting that 
two randomly chosen uncorrelated vectors tend to be nearly orthogonal in spaces of 
high dimensionality, indicating that the network is able to process the token identity 
information and the position information relatively separately. Note also that, be- 
cause of the residual connections across every layer, the position information does 
not get lost in going from one transformer layer to the next. Moreover, due to the 
linear processing layers in the transformer, a concatenated representation has similar 
properties to an additive one. 

The next task is to construct the embedding vectors {r„}. A simple approach 
would be to associate an integer 1, 2,3, ... with each position. However, this has the 
problem that the magnitude of the value increases without bound and therefore may 
start to corrupt the embedding vector significantly. Also it may not generalize well 
to new input sequences that are longer than those used in training, since these will 
involve coding values that lie outside the range of those used in training. Alterna- 
tively we could assign a number in the range (0, 1) to each token in the sequence, 
which keeps the representation bounded. However, this representation is not unique 
for a given position as it depends on the overall sequence length. 

An ideal positional encoding should provide a unique representation for each 
position, it should be bounded, it should generalize to longer sequences, and it should 
have a consistent way to express the number of steps between any two input vectors 
irrespective of their absolute position because the relative position of tokens is often 
more important than the absolute position. 

There are many approaches to positional encoding (Dufter, Schmitt, and Schiitze, 
2021). Here we describe a technique based on sinusoidal functions introduced by 
Vaswani et al. (2017). For a given position n the associated position-encoding vec- 
tor has components rn; given by 


sin (=) ; if 7 is even, 
fa = ‘2 (12.25) 
cos (7p): iftis odd. 


We see that the elements of the embedding vector rn are given by a series of sine and 
cosine functions of steadily increasing wavelength, as illustrated in Figure 12.10(a). 
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Figure 12.10 Illustrations of the functions defined by (12.25) and used to construct position-encoding vectors. 
(a) A plot in which the horizontal axis shows the different components of the embedding vector r whereas the 
vertical axis shows the position in the sequence. The values of the vector elements for two positions n and m are 
shown by the intersections of the sine and cosine curves with the horizontal grey lines. (b) A heat map illustration 
of the position-encoding vectors defined by (12.25) for dimension D = 100 with L = 30 for the first N = 200 
positions. 
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This encoding has the property that the elements of the vector r, all lie in the 
range (—1, 1). It is reminiscent of the way binary numbers are represented, with the 
lowest order bit alternating with high frequency, and subsequent bits alternating with 
steadily decreasing frequencies: 


OMNOhWN Ee 
RPreoovDnodqnoodoo 
GCDOrRrFRrFHF OCS 
COOrFrFOOFFK © 
FPOrROrROrFROF 


For the encoding given by (12.25), however, the vector elements are continuous 
variables rather than binary. A plot of the position-encoding vectors is shown in 
Figure 12.10(b). 

One nice property of the sinusoidal representation given by (12.25) is that, for 
any fixed offset k, the encoding at position n + k can be represented as a linear 
combination of the encoding at position n, in which the coefficients do not depend 
on the absolute position but only on the value of k. The network should therefore be 
able to learn to attend to relative positions. Note that this property requires that the 
encoding makes use of both sine and cosine functions. 

Another popular approach to positional representation is to use learned position 
encodings. This is done by having a vector of weights at each token position that 
can be learned jointly with the rest of the model parameters during training, and 
avoids using hand-crafted representations. Because the parameters are not shared 
between the token positions, the tokens are no longer invariant under a permutation, 
which is the purpose of a positional encoding. However, this approach does not 
meet the criteria we mentioned earlier of generalizing to longer input sequences, 
as the encoding will be untrained for positional encodings not seen during training. 
Therefore, this approach is generally most suitable when the input length is relatively 
constant during both training and inference. 


Natural Language 


Now that we have studied the architecture of the transformer, we will explore how 
this can be used to process language data consisting of words, sentences, and para- 
graphs. Although this is the modality that transformers were originally developed to 
operate on, they have proved to be a very general class of models and have become 
the state-of-the-art for most input data types. Later in this chapter we will look at 
their use in other domains. 

Many languages, including English, comprise a series of words separated by 
white space, along with punctuation symbols, and therefore represent an example of 
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sequential data. For the moment we will focus on the words, and we will return to 
punctuation later. 

The first challenge is to convert the words into a numerical representation that 
is suitable for use as the input to a deep neural network. One simple approach is to 
define a fixed dictionary of words and then introduce vectors of length equal to the 
size of the dictionary along with a ‘one hot’ representation for each word, in which 
the kth word in the dictionary is encoded with a vector having a 1 in position k and 
0 in all other positions. For example if ‘aardwolf’ is the third word in our dictionary 
then its vector representation would be (0,0, 1,0,...,0). 

An obvious problem with a one-hot representation is that a realistic dictionary 
might have several hundred thousand entries leading to vectors of very high dimen- 
sionality. Also, it does not capture any similarities or relationships that might exist 
between words. Both issues can be addressed by mapping the words into a lower- 
dimensional space through a process called word embedding in which each word is 
represented as a dense vector in a space of typically a few hundred dimensions. 


12.2.1 Word embedding 


The embedding process can be defined by a matrix E of size D x K where 
D is the dimensionality of the embedding space and K is the dimensionality of the 
dictionary. For each one-hot encoded input vector x, we can then calculate the 
corresponding embedding vector using 


Vv, = EX,- (12.26) 


Because x,, has a one-hot encoding, the vector v,, is simply given by the correspond- 
ing column of the matrix E. 

We can learn the matrix E from a corpus (i.e., a large data set) of text, and 
there are many approaches to doing this. Here we look at a popular technique called 
word2vec (Mikolov et al., 2013), which can be viewed as a simple two-layer neural 
network. A training set is constructed in which each sample is obtained by consid- 
ering a ‘window’ of M adjacent words in the text, where a typical value might be 
M = 5. The samples are considered to be independent, and the error function is de- 
fined as the sum of the error functions for each sample. There are two variants of this 
approach. In continuous bag of words, the target variable for network training is the 
middle word, and the remaining context words form the inputs, so that the network 
is being trained to ‘fill in the blank’. A closely related approach, called skip-grams, 
reverses the inputs and outputs, so that the centre word is presented as the input and 
the target values are the context words. These models are illustrated in Figure 12.11. 

This training procedure can be viewed as a form of self-supervised learning since 
the data consists simply of a large corpus of unlabelled text from which many small 
windows of word sequences are drawn at random. Labels are obtained from the text 
itself by ‘masking’ out those words whose values the network is trying to predict. 

Once the model is trained, the embedding matrix E is given by the transpose 
of the second-layer weight matrix for the continuous bag-of-words approach and 
by the first-layer weight matrix for skip-grams. Words that are semantically related 
are mapped to nearby positions in the embedding space. This is to be expected 
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Figure 12.11 Two-layer neural networks used to learn word embeddings, where (a) shows the continuous 
bag-of-words approach, and (b) shows the skip-grams approach. 


since related words are more likely to occur with similar context words compared 
to unrelated words. For example, the words ‘city’ and ‘capital’ might occur with 
higher frequency as context for target words such as ‘Paris’ or ‘London’ and less 
frequently as context for ‘orange’ or ‘polynomial’. The network can more easily 
predict the probability of the missing words if ‘Paris’ and ‘London’ are mapped to 
nearby embedding vectors. 

It turns out that the learned embedding space often has an even richer semantic 
structure than just the proximity of related words, and that this allows for simple 
vector arithmetic. For example, the concept that ‘Paris is to France as Rome is to 
Italy’ can be expressed through operations on the embedding vectors. If we use 
v(word) to denote the embedding vector for ‘word’, then we find 


v(Paris) — v(France) + v(Italy) ~ v(Rome). (12.27) 


Word embeddings were originally developed as natural language processing 
tools in their own right. Today, they are more likely to be used as pre-processing 
steps for deep neural networks. In this regard they can be viewed as the first layer 
in a deep neural network. They can be fixed using some standard pre-trained em- 
bedding matrix, or they can be treated as an adaptive layer that is learned as part of 
the overall end-to-end training of the system. In the latter case the embedding layer 
can be initialized either using random weight values or using a standard embedding 
matrix. 


Figure 12.12 An illustration of the process of 
tokenizing natural language by analogy with byte 
pair encoding. In this example, the most fre- 
quently occurring pair of characters is ‘pe’, which 
occurs four times, and so these form a new to- 
ken that replaces all the occurrences of ‘pe’. 
Note that ‘Pe’ is not included in this since upper- 
case ‘P’ and lower-case ‘p’ are distinct charac- 
ters. Next the pair ‘ck’ is added since this occurs 
three times. This is followed by tokens such as 
‘pi’, ‘ed’, and ‘per’, all of which occur twice, and 
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Peter Piper picked a peck of pickled peppers 
Peter Piper picked a peck of pickled peppers 
Peter Piper picked a peck of pickled peppers 
Peter Piper picked a peck of pickled peppers 
Peter Piper picked a peck of pickled peppers 
Peter Piper picked a peck of pickled peppers 


so on. 


Section 12.4.1 


12.2.2 Tokenization 


One problem with using a fixed dictionary of words is that it cannot cope with 
words not in the dictionary or which are misspelled. It also does not take account 
of punctuation symbols or other character sequences such as computer code. An 
alternative approach that addresses these problems would be to work at the level of 
characters instead of using words, so that our dictionary comprises upper-case and 
lower-case letters, numbers, punctuation, and white-space symbols such as spaces 
and tabs. A disadvantage of this approach, however, is that it discards the semanti- 
cally important word structure of language, and the subsequent neural network would 
have to learn to reassemble words from elementary characters. It would also require 
a much larger number of sequential steps for a given body of text, thereby increasing 
the computational cost of processing the sequence. 

We can combine the benefits of character-level and word-level representations 
by using a pre-processing step that converts a string of words and punctuation sym- 
bols into a string of tokens, which are generally small groups of characters and might 
include common words in their entirety, along with fragments of longer words as 
well as individual characters that can be assembled into less common words (Schus- 
ter and Nakajima, 2012). This tokenization also allows the system to process other 
kinds of sequences such as computer code or even other modalities such as images. 
It also means that variations of the same word can have related representations. For 
example, ‘cook’, ‘cooks’, ‘cooked’, ‘cooking’, and ‘cooker’ are all related and share 
the common element ‘cook’, which itself could be represented as one of the tokens. 

There are many approaches to tokenization. As an example, a technique called 
byte pair encoding that is used for data compression, can be adapted to text tokeniza- 
tion by merging characters instead of bytes (Sennrich, Haddow, and Birch, 2015). 
The process starts with the individual characters and iteratively merges them into 
longer strings. The list of tokens is first initialized with the list of individual char- 
acters. Then a body of text is searched for the most frequently occurring adjacent 
pairs of tokens and these are replaced with a new token. To ensure that words are not 
merged, a new token is not formed from two tokens if the second token starts with a 
white space. The process is repeated iteratively as illustrated in Figure 12.12. 

Initially the number of tokens is equal to the number of characters, which is 
relatively small. As tokens are formed, the total number of tokens increases, and 
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if this is continued long enough, the tokens will eventually correspond to the set of 
words in the text. The total number of tokens is generally fixed in advance, as a 
compromise between character-level and word-level representations. The algorithm 
is stopped when this number of tokens is reached. 

In practical applications of deep learning to natural language, the input text is 
typically first mapped into a tokenized representation. However, for the remainder of 
this chapter, we will use word-level representations as this makes it easier to illustrate 
and motivate key concepts. 


12.2.3 Bag of words 


We now turn to the task of modelling the joint distribution p(x),...,xy) of an 
ordered sequence of vectors, such as words (or tokens) in a natural language. The 
simplest approach is to assume that the words are drawn independently from the 
same distribution and hence that the joint distribution is fully factorized in the form 


N 
p(X1,---,xv) = [| p(n). (12.28) 
n=1 


This can be expressed as a probabilistic graphical model in which the nodes are 
isolated with no interconnecting links. 

The distribution p(x) is shared across the variables and can be represented, with- 
out loss of generality, as a simple table listing the probabilities of each of the possi- 
ble states of x (corresponding to the dictionary of words or tokens). The maximum 
likelihood solution for this model is obtained simply by setting each of these proba- 
bilities to the fraction of times that the word occurs in the training set. This is known 
as a bag-of-words model because it completely ignores the ordering of the words. 

We can use the bag-of-words approach to construct a simple text classifier. This 
could be used for example in sentiment analysis in which a passage of text represent- 
ing a restaurant review is to be classified as positive or negative. The naive Bayes 
classifier assumes that the words are independent within each class Ck, but with a 
different distribution for each class, so that 


N 
p(X1;,---, XN|Ck) = II P(Xn|Cx). (12.29) 
n=1 


Given prior class probabilities p(C;,), the posterior class probabilities for a new se- 
quence are given by: 


N 
p(Cy|X1,.--,Xn) X p(Cr) II p(Xn|Cx). (12.30) 


n=1 


Both the class-conditional densities p(x|C;,,) and the prior probabilities p(C;,) can 
be estimated using frequencies from the training data set. For a new sequence, the 
table entries are multiplied together to get the desired posterior probabilities. Note 
that if a word occurs in the test set that was not present in the training set then the 
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corresponding probability estimate will be zero, and so these estimates are typically 
‘smoothed’ after training by reassigning a small level of probability uniformly across 
all entries to avoid zero values. 


12.2.4 Autoregressive models 


One obvious major limitation of the bag-of-words model is that it completely 
ignores word order. To address this we can take an autoregressive approach. Without 
loss of generality we can decompose the distribution over the sequence of words into 
a product of conditional distributions in the form 


N 
p(x, ..-, XN) = J [pnia (12.31) 
n=1 


This can be represented as a probabilistic graphical model in which each node in the 
sequence receives a link from every previous node. We could represent each term 
on the right-hand side of (12.31) by a table whose entries are once again estimated 
using simple frequency counts from the training set. However, the size of these 
tables grows exponentially with the length of the sequence, and so this approach 
would become prohibitively expensive. 

We can simplify the model dramatically by assuming that each of the condi- 
tional distributions on the right-hand side of (12.31) is independent of all previous 
observations except the L most recent words. For example, if L = 2 then the joint 
distribution for a sequence of N observations under this model is given by 


N 
p(x, e. ,Xy) = p(xı)p(x2|x1) II P(Xn|Xn-1; Xn-2). (12.32) 


n=3 


In the corresponding graphical model each node has links from the two previous 
nodes. Here we assume that the conditional distributions p(x,,|xn—1) are shared 
across all variables. Again each of the distributions on the right-hand side of (12.32) 
can be represented as tables whose values are estimated from the statistics of triplets 
of successive words drawn from a training corpus. 

The case with L = 1 is known as a bi-gram model because it depends on pairs 
of adjacent words. Similarly L = 2, which involves triplets of adjacent words, is 
called a tri-gram model, and in general these are called n-gram models. 

All the models discussed so far in this section can be run generatively to synthe- 
size novel text. For example, if we provide the first and second words in a sequence, 
then we can sample from the tri-gram statistics p(xn|Xn—1,Xn—2) to generate the 
third word, and then we can use the second and third words to sample the fourth 
word, and so on. The resulting text, however, will be incoherent because each word 
is predicted only on the basis of the two previous words. High-quality text models 
must take account of the long-range dependencies in language. On the other hand, 
we cannot simply increase the value of L because the size of the probability tables 
grows exponentially in L so that it is prohibitively expensive to go much beyond 
tri-gram models. However, the autoregressive representation will play a central role 
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Figure 12.13 A general RNN with parame- 

ters w. It takes a sequence x1,...,xw as input yı y2 y3 
and generates a sequence y1,..., yN as out- 

put. Each of the boxes corresponds to a multi- 

layer network with nonlinear hidden units. Z1 Z2 
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when we consider modern language models based not on probability tables but on 
deep neural networks configured as transformers. 

One way to allow longer-range dependencies, while avoiding the exponential 
growth in the number of parameters of an n-gram model, is to use a hidden Markov 
model whose graphical structure is shown in Figure 11.31. The number of learn- 
able parameters is governed by the dimensionality of the latent variables whereas 
the distribution over a given observation x,, depends, in principle, on all previous 
observations. However, the influence of more distant observations is still very lim- 
ited since their effect must be carried through the chain of latent states which are 
themselves being updated by more recent observations. 


12.2.5 Recurrent neural networks 


Techniques such as n-grams have very poor scaling with sequence length be- 
cause they store completely general tables of conditional distributions. We can 
achieve much better scaling by using parameterized models based on neural net- 
works. Suppose we simply apply a standard feed-forward neural network to se- 
quences of words in natural language. One problem that arises is that the network 
has a fixed number of inputs and outputs, whereas we need to be able to handle se- 
quences in the training and test sets that have variable length. Furthermore, if a word, 
or group of words, at a particular location in a sequence represents some concept then 
the same word, or group of words, at a different location is likely to represent the 
same concept at that new location. This is reminiscent of the equivariance property 
we encountered in processing image data. If we can construct a network architecture 
that is able to share parameters across the sequence then not only can we capture this 
equivariance property but we can greatly reduce the number of free parameters in 
the model as well as handle sequences having different lengths. 

To address this we can borrow inspiration from the hidden Markov model and 
introduce an explicit hidden variable z,, associated with each step n in the sequence. 
The neural network takes as input both the current word x,, and the current hidden 
state Zn—ı and produces an output word yn as well as the next state Zn of the hidden 
variable. We can then chain together copies of this network, in which the weight 
values are shared across the copies. The resulting architecture is called a recurrent 
neural network (RNN) and is illustrated in Figure 12.13. Here the initial value of 
the hidden state may be initialized for example to some default value such as zp) = 
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encoder decoder 


Figure 12.14 An example of a recurrent neural network used for language translation. See the text for details. 
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As an example of how an RNN might be used in practice, consider the spe- 
cific task of translating sentences from English into Dutch. The sentences can have 
variable length, and each output sentence might have a different length from the cor- 
responding input sentence. Furthermore, the network may need to see the whole of 
the input sentence before it can even start to generate the output sentence. We can 
address this using an RNN by feeding in the complete English sentence followed by 
a special input token, which we denote by (start), to trigger the start of translation. 
During training the network learns to associate (start) with the beginning of the 
output sentence. We also take each successively generated word and feed it into the 
input at the next time step, as shown in Figure 12.14. The network can be trained 
to generate a specific (stop) token to signify the completion of the translation. The 
first few stages of the network are used to absorb the input sequence, and the associ- 
ated output vectors are simply ignored. This part of the network can be viewed as an 
‘encoder’ in which the entire input sentence has been compressed into the state z* of 
the hidden variable. The remaining network stages function as the ‘decoder’, which 
generates the translated sentence as output one word at a time. Notice that each out- 
put word is fed as input to the next stage of the network, and so this approach has an 
autoregressive structure analogous to (12.31). 


12.2.6 Backpropagation through time 


RNNs can be trained by stochastic gradient descent using gradients calculated 
by backpropagation and evaluated through automatic differentiation, just as with reg- 
ular neural networks. The error function consists of a sum over all output units of 
the error for each unit, in which each output unit has a softmax activation function 
along with an associated cross-entropy error function. During forward propagation, 
the activation values are propagated all the way from the first input in the sequence 
through to all the output nodes in the sequence, and error signals are then backprop- 
agated along the same paths. This process is called backpropagation through time 
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and in principle is straightforward. However, in practice, for very long sequences, 
training can be difficult due to the problems of vanishing gradients or exploding 
gradients that arise with very deep network architectures. 

Another problem with standard RNNs is that they deal poorly with long-range 
dependencies. This is especially problematic for natural language where such depen- 
dencies are widespread. In a long passage of text, a concept might be introduced that 
plays an important role in predicting words occurring much later in the text. In the 
architecture shown in Figure 12.14, the entire concept of the English sentence must 
be captured in the single hidden vector z* of fixed length, and this becomes increas- 
ingly problematic with longer sequences. This is known as the bottleneck problem 
because a sequence of arbitrary length has to be summarized in a single hidden vec- 
tor of activations and the network can start to generate the output translation only 
once the full input sequence has been processed. 

One approach for addressing both the vanishing and exploding gradients prob- 
lems and the limited long-range dependencies is to modify the architecture of the 
neural network to allow additional signal paths that bypass many of the processing 
steps within each stage of the network and hence allow information to be remem- 
bered over a larger number of time steps. Long short-term memory (LSTM) models 
(Hochreiter and Schmidhuber, 1997) and gated recurrent unit (GRU) models (Cho et 
al., 2014) are the most widely known examples. Although they improve performance 
compared to standard RNNs, they still have a limited ability to model long-range de- 
pendencies. Also, the additional complexity of each cell means that LSTMs are even 
slower to train than standard RNNs. Furthermore, all recurrent networks have signal 
paths that grow linearly with the number of steps in the sequence. Moreover, they do 
not support parallel computation within a single training example due to the sequen- 
tial nature of the processing. In particular, this means that RNNs struggle to make 
efficient use of modern highly parallel hardware based on GPUs. These problems 
are addressed by replacing RNNs with transformers. 


Transformer Language Models 


The transformer processing layer is a highly flexible component for building pow- 
erful neural network models with broad applicability. In this section we explore the 
application of transformers to natural language. This has given rise to the develop- 
ment of massive neural networks known as large language models (LLMs), which 
have proven to be exceptionally capable (Zhao et al., 2023). 

Transformers can be applied to many different kinds of language processing 
task, and can be grouped into three categories according to the form of the input 
and output data. In a problem such as sentiment analysis, we take a sequence of 
words as input and provide a single variable representing the sentiment of the text, 
for example happy or sad, as output. Here a transformer is acting as an ‘encoder’ 
of the sequence. Other problems might take a single vector as input and generate 
a word sequence as output, for example if we wish to generate a text caption given 
an input image. In such cases the transformer functions as a ‘decoder’, generating 
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a sequence as output. Finally, in sequence-to-sequence processing tasks, both the 
input and the output comprise a sequence of words, for example if our goal is to 
translate from one language to another. In this case, transformers are used in both 
encoder and decoder roles. We discuss each of these classes of language model in 
turn, using illustrative examples of model architectures. 


12.3.1 Decoder transformers 


We start by considering decoder-only transformer models. These can be used as 
generative models that create output sequences of tokens. As an illustrative example, 
we will focus on a class of models called GPT which stands for generative pre- 
trained transformer (Radford et al., 2019; Brown et al., 2020; OpenAI, 2023). The 
goal is to use the transformer architecture to construct an autoregressive model of the 
form defined by (12.31) in which the conditional distributions p(xp,|X1,...,Xn—1) 
are expressed using a transformer neural network that is learned from data. 

The model takes as input a sequence consisting of the first n — 1 tokens, and 
its corresponding output represents the conditional distribution for token n. If we 
draw a sample from this distribution then we have extended the sequence to n tokens 
and this new sequence can be fed back through the model to give a distribution over 
token n + 1, and so on. The process can be repeated to generate sequences up to 
a maximum length determined by the number of inputs to the transformer. We will 
shortly discuss strategies for sampling from the conditional distributions, but for the 
moment we focus on how to construct and train the network. 

The architecture of a GPT model consists of a stack of transformer layers that 
take a sequence xi,...,xXyw Of tokens, each of dimensionality D, as input and pro- 
duce a sequence X;,...,Xy of tokens, again of dimensionality D, as output. Each 
output needs to represent a probability distribution over the dictionary of tokens at 
that time step, and this dictionary has dimensionality K whereas the tokens have a 
dimensionality of D. We therefore make a linear transformation of each output to- 
ken using a matrix WP) of dimensionality D x K followed by a softmax activation 
function in the form 

Y = Softmax (XW) (12.33) 


where Y is a matrix whose nth row is yf, and X is a matrix whose nth row is 
XT. Each softmax output unit has an associated cross-entropy error function. The 
architecture of the model is shown in Figure 12.15. 

The model can be trained using a large corpus of unlabelled natural language 
by taking a self-supervised approach. Each training sample consists of a sequence 
of tokens X1, . . . , Xn, which form the input to the network, along with an associated 
target value x,,11 consisting of the next token in the sequence. The sequences are 
considered to be independent and identically distributed so that the error function 
used for training is the sum of the cross-entropy error values summed over the train- 
ing set, grouped into appropriate mini-batches. Naively we could process each such 
training sample independently using a forward pass through the model. However, 
we can achieve much greater efficiency by processing an entire sequence at once so 
that each token acts both as a target value for the sequence of previous tokens and as 
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Figure 12.15 Architecture of a GPT decoder transformer network. Here ‘LSM’ stands for linear-softmax and 
denotes a linear transformation whose learnable parameters are shared across the token positions, followed by 
a softmax activation function. Masking is explained in the text. 


an input value for subsequent tokens. For example, consider the word sequence 
| swam across the river to get to the other bank. 


We can use ‘I swam across’ as an input sequence with an associated target of ‘the’, 
and also use ‘I swam across the’ as an input sequence with an associated target of 
‘river’, and so on. However, to process these in parallel we have to ensure that the 
network is not able to ‘cheat’ by looking ahead in the sequence, otherwise it will 
simply learn to copy the next input directly to the output. If it did this, it would 
then be unable to generate new sequences since the subsequent token by definition is 
not available at test time. To address this problem we do two things. First, we shift 
the input sequence to the right by one step, so that input x,, corresponds to output 
Ynii, With target Xn+1, and an additional special token denoted (start) is pre- 
pended in the first position of the input sequence. Second, note that the tokens in a 
transformer are processed independently, except when they are used to compute the 
attention weights, when they interact in pairs through the dot product. We therefore 
introduce masked attention, sometimes called causal attention, into each of the at- 
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Figure 12.16 An illustration of the mask matrix 
for masked self-attention. Atten- l 
tion weights corresponding to the 
red elements are set to zero. Thus, 


in predicting the token ‘across’, a Siwan 
the output can depend only on = 
the input tokens ‘(start)’ ‘I’ and Q across 
‘swam’. 5 
° 
the 
river 
th n 
y — E 8 e 
G E 
inputs 


tention layers, in which we set to zero all of the attention weights that correspond to 
a token attending to any later token in the sequence. This simply involves setting to 
zero all the corresponding elements of the attention matrix Attention(Q, K, V) de- 
fined by (12.14) and then normalizing the remaining elements so that each row once 
again sums to one. In practice, this can be achieved by setting the corresponding 
pre-activation values to —oo so that the softmax evaluates to zero for the associated 
outputs and also takes care of the normalization across the non-zero outputs. The 
structure of the masked attention matrix is illustrated in Figure 12.16. 

In practice, we wish to make efficient use of the massive parallelism of GPUs, 
and hence multiple sequences may be stacked together into an input tensor for par- 
allel processing in a single batch. However, this requires the sequences to be of the 
same length, whereas text sequences naturally have variable length. This can be ad- 
dressed by introducing a specific token, which we denote by (pad), that is used to 
fill unused positions to bring all sequences up to the same length so that they can 
be combined into a single tensor. An additional mask is then used in the attention 
weights to ensure that the output vectors do not pay attention to any inputs occupied 
by the (pad) token. Note that the form of this mask depends on the particular input 
sequence. 

The output of the trained model is a probability distribution over the space of 
tokens, given by the softmax output activation function, which represents the prob- 
ability of the next token given the current token sequence. Once this next word is 
chosen, the token sequence with the new token included can then be fed through the 
model again to generate the subsequent token in the sequence, and this process can 
be repeated indefinitely or until an end-of-sequence token is generated. This may ap- 
pear to be quite inefficient since data must be fed through the whole model for each 
new generated token. However, note that due to the masked attention, the embedding 
learned for a particular token depends only on that token itself and on earlier tokens 
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and hence does not change when a new, later token is generated. Consequently, much 
of the computation can be recycled when processing a new token. 


12.3.2 Sampling strategies 


We have seen that the output of a decoder transformer is a probability distribu- 
tion over values for the next token in the sequence, from which a particular value 
for that token must be chosen to extend the sequence. There are several options for 
selecting the value of the token based on the computed probabilities (Holtzman et 
al., 2019). One obvious approach, called greedy search, is simply to select the token 
with the highest probability. This has the effect of making the model deterministic, 
in that a given input sequence always generates the same output sequence. Note that 
simply choosing the highest probability token at each stage is not the same as select- 
ing the highest probability sequence of tokens. To find the most probable sequence, 
we would need to maximize the joint distribution over all tokens, which is given by 


N 
pyi,- yN) = J| pyayi- Yn). (12.34) 
n=1 


If there are N steps in the sequence and the number of token values in the dictionary 
is K then the total number of sequences is O(K™ ), which grows exponentially with 
the length of the sequence, and hence finding the single most probable sequence is 
infeasible. By comparison, greedy search has cost O(N), which is linear in the 
sequence length. 

One technique that has the potential to generate higher probability sequences 
than greedy search is called beam search. Instead of choosing the single most proba- 
ble token value at each step, we maintain a set of B hypotheses, where B is called the 
beam width, each consisting of a sequence of token values up to step n. We then feed 
all these sequences through the network, and for each sequence we find the B most 
probable token values, thereby creating B? possible hypotheses for the extended 
sequence. This list is then pruned by selecting the most probable B hypotheses ac- 
cording to the total probability of the extended sequence. Thus, the beam search 
algorithm maintains B alternative sequences and keeps track of their probabilities, 
finally selecting the most probable sequence amongst those considered. Because the 
probability of a sequence is obtained by multiplying the probabilities at each step of 
the sequence and since these probability are always less than or equal to one, a long 
sequence will generally have a lower probability than a short one, biasing the results 
towards short sequences. For this reason the sequence probabilities are generally 
normalized by the corresponding lengths of the sequence before making compar- 
isons. Beam search has cost O( BK N), which is again linear in the sequence length. 
However, the cost of generating a sequence is increased by a factor of B, and so for 
very large language models, where the cost of inference can become significant, this 
makes beam search much less attractive. 

One problem with approaches such as greedy search and beam search is that they 
limit the diversity of potential outputs and can even cause the generation process to 
become stuck in a loop, where the same sub-sequence of words is repeated over and 
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Figure 12.17 A comparison of the token probabilities from beam search and human text for a given trained 
transformer language model and a given initial input sequence, showing how the human sequence has much 
lower token probabilities. [From Holtzman et al. (2019) with permission.] 


over. As can be seen in Figure 12.17, human-generated text may have lower proba- 
bility and hence be more surprising with respect to a given model than automatically 
generated text. 

Instead of trying to find a sequence with the highest probability, we can instead 
generate successive tokens simply by sampling from the softmax distribution at each 
step. However, this can lead to sequences that are nonsensical. This arises from 
the typically very large size of the token dictionary, in which there is a long tail of 
many token states each of which has a very small probability but which in aggregate 
account for a significant fraction of the total probability mass. This leads to the 
problem in which there is a significant chance that the system will make a bad choice 
for the next token. 

As a balance between these extremes, we can consider only the states having the 
top K probabilities, for some choice of K, and then sample from these according to 
their renormalized probabilities. A variant of this approach, called top-p sampling 
or nucleus sampling, calculates the cumulative probability of the top outputs until a 
threshold is reached and then samples from this restricted set of token states. 

A ‘softer’ version of top-K sampling is to introduce a parameter T called tem- 
perature into the definition of the softmax function (Hinton, Vinyals, and Dean, 
2015) so that 

exp(a;/T) 


"= S expla; /T) 


and then sample the next token from this modified distribution. When T = 0, the 
probability mass is concentrated on the most probable state, with all other states 
having zero probability, and hence this becomes greedy selection. For T = 1, we 


(12.35) 
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recover the unmodified softmax distribution, and as T — ov, the distribution be- 
comes uniform across all states. By choosing a value in the range 0 < T < 1, the 
probability is concentrated towards the higher values. 

One challenge with sequence generation is that during the learning phase, the 
model is trained on a human-generated input sequence, whereas when it is running 
generatively, the input sequence is itself generated from the model. This means that 
the model can drift away from the distribution of sequences seen during training. 


12.3.3 Encoder transformers 


We next consider transformer language models based on encoders, which are 
models that take sequences as input and produce fixed-length vectors, such as class 
labels, as output. An example of such a model is BERT, which stands for bidirec- 
tional encoder representations from transformers (Devlin et al., 2018). The goal is 
to pre-train a language model using a large corpus of text and then to fine-tune the 
model using transfer learning for a broad range of downstream tasks each of which 
requires a smaller application-specific training data set. The architecture of an en- 
coder transformer is illustrated in Figure 12.18. This approach is a straightforward 
application of the transformer layers discussed previously. 

The first token of every input string is given by a special token (class), and the 
corresponding output of the model is ignored during pre-training. Its role will be- 
come apparent when we discuss fine-tuning. The model is pre-trained by presenting 
token sequences at the input. A randomly chosen subset of the tokens, say 15%, are 
replaced with a special token denoted (mask). The model is trained to predict the 
missing tokens at the corresponding output nodes. This is analogous to the masking 
used in word2vec to learn word embeddings. For example, an input sequence might 
be 


| (mask) across the river to get to the (mask) bank. 


and the network should predict ‘swam’ at output node 2 and ‘other’ at output node 
10. In this case only two of the outputs contribute to the error function and the other 
outputs are ignored. 

The term ‘bidirectional’ refers to the fact that the network sees words both be- 
fore and after the masked word and can use both sources of information to make a 
prediction. As a consequence, unlike decoder models, there is no need to shift the 
inputs to the right by one place, and there is no need to mask the outputs of each layer 
from seeing input tokens occurring later in the sequence. Compared to the decoder 
model, an encoder is less efficient since only a fraction of the sequence tokens are 
used as training labels. Moreover, an encoder model is unable to generate sequences. 

The procedure of replacing randomly selected tokens with (mask) means the 
training set has a mismatch compared to subsequent fine-tuning sets in that the lat- 
ter will not contain any (mask) tokens. To mitigate any problems this might cause, 
Devlin et al. (2018) modified the procedure slightly, so that of the 15% of randomly 
selected tokens, 80% are replaced with (mask), 10% are replaced with a word se- 
lected at random from the vocabulary, and in 10% of the cases, the original words 
are retained at the input, but they still have to be correctly predicted at the output. 
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Figure 12.18 Architecture of an encoder transformer model. The boxes labelled ‘LSM’ denote a linear trans- 
formation whose learnable parameters are shared across the token positions, followed by a softmax activation 
function. The main differences compared to the decoder model are that the input sequence is not shifted to the 
right, and the ‘look ahead’ masking matrix is omitted and therefore, within each self-attention layer, every output 
token can attend to any of the input tokens. 


Once the encoder model is trained it can then be fine-tuned for a variety of 
different tasks. To do this a new output layer is constructed whose form is specific 
to the task being solved. For a text classification task, only the first output position 
is used, which corresponds to the (class) token that always appears in the first 
position of the input sequence. If this output has dimension D then a matrix of 
parameters of dimension D x K, where K is the number of classes, is appended to 
the first output node and this in turn feeds into a k-dimensional softmax function 
or a vector of dimension D x 1 followed by a logistic sigmoid for K = 2. The 
linear output transformation could alternatively be replaced with a more complex 
differentiable model such as an MLP. If the goal is to classify each token of the 
input string, for example to assign each token to a category (such as person, place, 
colour, etc) then the first output is ignored and the subsequent outputs have a shared 
linear-plus-softmax layer. During fine-tuning all model parameters including the 
new output matrix are learned by stochastic gradient descent using the log probability 
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of the correct label. Alternatively the output of a pre-trained model might feed into a 
sophisticated generative deep learning model for applications such as text-to-image 
synthesis. 


12.3.4 Sequence-to-sequence transformers 


For completeness, we discuss briefly the third category of transformer model, 
which combines an encoder with a decoder, as discussed in the original transformer 
paper of Vaswani et al. (2017). Consider the task of translating an English sentence 
into a Dutch sentence. We can use a decoder model to generate the token sequence 
corresponding to the Dutch output, token by token, as discussed previously. The 
main difference is that this output needs to be conditioned on the entire input se- 
quence corresponding to the English sentence. An encoder transformer can be used 
to map the input token sequence into a suitable internal representation, which we 
denote by Z. To incorporate Z into the generative process for the output sequence, 
we use a modified form of the attention mechanism called cross attention. This is the 
same as self-attention except that although the query vectors come from the sequence 
being generated, in this case the Dutch output sequence, the key and value vectors 
come from the sequence represented by Z, as illustrated in Figure 12.19. Returning 
to our analogy with a video streaming service, this would be like the user sending 
their query vector to a different streaming company who then compares it with their 
own set of key vectors to find the best match and then returns the associated value 
vector in the form of a movie. 

When we combine the encoder and decoder modules, we obtain the architecture 
of the model shown in Figure 12.20. The model can be trained using paired input 
and output sentences. 


12.3.5 Large language models 


The most important recent development in the field of machine learning has 
been the creation of very large transformer-based neural networks for natural lan- 
guage processing, known as large language models or LLMs. Here ‘large’ refers to 
the number of weight and bias parameters in the network, which can number up to 
around one trillion (10'”) at the time of writing. Such models are expensive to train, 
and the motivation for building them comes from their extraordinary capabilities. 

In addition to the availability of large data sets, the training of ever larger mod- 
els has been facilitated by the advent of massively parallel training hardware based 
on GPUs (graphics processing units) and similar processors tightly coupled in large 
clusters equipped fast interconnect and lots of onboard memory. The transformer 
architecture has played a key role in the development of these models because it is 
able to make very efficient use of such hardware. Very often, increasing the size of 
the training data set, along with a commensurate increase in the number of model pa- 
rameters, leads to improvements in performance that outpace architectural improve- 
ments or other ways to incorporate more domain knowledge (Sutton, 2019; Kaplan 
et al., 2020). For example, the impressive increase in performance of the GPT se- 
ries of models (Radford et al., 2019; Brown et al., 2020; OpenAI, 2023) through 
successive generations has come primarily from an increase in scale. These kinds 
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of performance improvements have driven a new kind of Moore’s law in which the 
number of compute operations required to train a state-of-the-art machine learning 
model has grown exponentially since about 2012 with a doubling time of around 3.4 
months. 

Early language models were trained using supervised learning. For example, to 
build a translation system, the training set would consist of matched pairs of sen- 
tences in two languages. A major limitation of supervised learning, however, is 
that the data typically has to be human-curated to provide labelled examples, and 
this severely limits the quantity of data available, thereby requiring heavy use of 
inductive biases such as feature engineering and architecture constraints to achieve 
reasonable performance. 

Large language models are trained instead by self-supervised learning on very 
large data sets of text, along with potentially other token sequences such as computer 
code. We have seen how a decoder transformer can be trained on token sequences 
in which each token acts as a labelled target example, with the preceding sequence 
as input, to learn a conditional probability distribution. This ‘self-labelling’ hugely 
expands the quantity of training data available and therefore allows exploitation of 
deep neural networks having large numbers of parameters. 
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Figure 12.20 Schematic illustration of a sequence-to-sequence transformer. To keep the diagram uncluttered 
the input tokens are collectively shown as a single box, and likewise for the output tokens. Positional-encoding 
vectors are added to the input tokens for both the encoder and decoder sections. Each layer in the encoder 
corresponds to the structure shown in Figure 12.9, and each cross-attention layer is of the form shown in Fig- 
ure 12.19. 


This use of self-supervised learning led to a paradigm shift in which a large 
model is first pre-trained using unlabelled data and then subsequently fine-tuned 
using supervised learning based on a much smaller set of labelled data. This is 
effectively a form of transfer learning, and the same pre-trained model can be used 
for multiple ‘downstream’ applications. A model with broad capabilities that can be 
subsequently fine-tuned for specific tasks is called a foundation model (Bommasani 
et al., 2021). 

The fine-tuning can be done by adding extra layers to the outputs of the network 
or by replacing the last few layers with fresh parameters and then using the labelled 
data to train these final layers. During the fine-tuning stage, the weights and biases 
in the main model can either be left unchanged or be allowed to undergo small levels 
of adaptation. Typically the cost of the fine-tuning is small compared to that of pre- 
training. 

One very efficient approach to fine-tuning is called low-rank adaptation or LORA 
(Hu et al., 2021). This approach is inspired by results which show that a trained over- 
parameterized model has a low intrinsic dimensionality with respect to fine-tuning, 
meaning that changes in the model parameters during fine-tuning lie on a manifold 
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Schematic illustration low-rank adaptation showing a weight matrix Wo from one of the 
attention layers in a pre-trained transformer. Additional weights given by matrices A and 
B are adapted during fine-tuning and their product AB is then added to the original matrix 
for subsequent inference. 


whose dimensionality is much smaller than the total number of learnable parameters 
in the model (Aghajanyan, Zettlemoyer, and Gupta, 2020). LoRa exploits this by 
freezing the weights of the original model and adding additional learnable weight 
matrices into each layer of the transformer in the form of low-rank products. Typi- 
cally only attention-layer weights are modified, whereas MLP-layer weights are kept 
fixed. Consider a weight matrix Wo having dimension D x D, which might rep- 
resent a query, key, or value matrix in which the matrices from multiple attention 
heads are treated together as a single matrix. We introduce a parallel set of weights 
defined by the product of two matrices A and B with dimensions D x R and R x D, 
respectively, as shown schematically in Figure 12.21. This layer then generates an 
output given by XWo + XAB. The number of parameters in the additional weight 
matrix AB is 2RD compared to the D? parameters in the original weight matrix 
Wo, and so if R < D then the number of parameters that need to be adapted during 
fine-tuning is much smaller than the number in the original transformer. In prac- 
tice, this can reduce the number of parameters that need to be trained by a factor of 
10,000. Once the fine-tuning is complete, the additional weights can be added to the 
original weight matrices to give a new weight matrix 


W =W, +AB (12.36) 


so that during inference there is no additional computational overhead compared to 
running the original model since the updated model has the same size as the original. 
As language models have become larger and more powerful, the need for fine- 
tuning has diminished, with generative language models now able to solve a broad 
range of tasks simply through text-based interaction. For example, if a text string 


English: the cat sat on the mat. French: 


is given as the input sequence, an autoregressive language model can continue to gen- 
erate subsequent tokens until a (stop) token is generated, in which the newly gen- 
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erated tokens represent the French translation. Note that the model was not trained 
specifically to do translation but has learned to do so as a result of being trained on a 
vast corpus of data that includes multiple languages. 

A user can interact with such models using a natural language dialogue, mak- 
ing them very accessible to broad audiences. To improve the user experience and 
the quality of the generated outputs, techniques have been developed for fine-tuning 
large language models through human evaluation of generated output, using methods 
such as reinforcement learning through human feedback or RLHF (Christiano et al., 
2017). Such techniques have helped to create large language models with impres- 
sively easy-to-use conversational interfaces, most notably the system from OpenAI 
called ChatGPT. 

The sequence of input tokens given by the user is called a prompt. For example, 
it might consist of the opening words of a story, which the model is required to com- 
plete. Or it might comprise a question, and the model should provide the answer. By 
using different prompts, the same trained neural network may be capable of solving a 
broad range of tasks such as generating computer code from a simple text request or 
writing rhyming poetry on demand. The performance of the model now depends on 
the form of the prompt, leading to a new field called prompt engineering (Liu et al., 
2021), which aims to design a good form for a prompt that results in high-quality 
output for the downstream task. The behaviour of the model can also be modified by 
adapting the user’s prompt before feeding it into the language model by pre-pending 
an additional token sequence called a prefix prompt to the user prompt to modify 
the form of the output. For example, the pre-prompt might consist of instructions, 
expressed in standard English, to tell the network not to include offensive language 
in its output. 

This allows the model to solve new tasks simply by providing some examples 
within the prompt, without needing to adapt the parameters of the model. This is an 
example of few-shot learning. 

Current state-of-the-art models such as GPT-4 have become so powerful that 
they are exhibiting remarkable properties which have been described as the first in- 
dications of artificial general intelligence (Bubeck et al., 2023) and are driving a 
new wave of technological innovation. Moreover, the capabilities of these models 
continue to improve at an impressive pace. 


Multimodal Transformers 


Although transformers were initially developed as an alternative to recurrent net- 
works for processing sequential language data, they have become prevalent in nearly 
all areas of deep learning. They have proved to be general-purpose models, as they 
make very few assumptions about the input data, in contrast, for example, to convo- 
lutional networks, which make strong assumptions about equivariances and locality. 
Due to their generality, transformers have become the state-of-the-art for many dif- 
ferent modalities, including text, image, video, point cloud, and audio data, and have 
been used for both discriminative and generative applications within each of these 
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domains. The core architecture of the transformer layer has remained relatively con- 
stant, both over time and across applications. Therefore, the key innovations that 
enabled the use of transformers in areas other than natural language have largely 
focused on the representation and encoding of the inputs and outputs. 

One big advantage of a single architecture that is capable of processing many 
different kinds of data is that it makes multimodal computation relatively straight- 
forward. In this context, multimodal refers to applications that combine two or more 
different types of of data, either in the inputs or outputs or both. For example, we 
may wish to generate an image from a text prompt or design a robot that can com- 
bine information from multiple sensors such as cameras, radar, and microphones. 
The important thing to note is that if we can tokenize the inputs and decode the 
output tokens, then it is likely that we can use a transformer. 


12.4.1 Vision transformers 


Transformers have been applied with great success to computer vision and have 
achieved state-of-the-art performance on many tasks. The most common choice for 
discriminative tasks is a standard transformer encoder, and this approach in the vi- 
sion domain is known as a vision transformer, or ViT (Dosovitskiy et al., 2020). 
When using a transformer, we need to decide how to convert an input image into 
tokens, and the simplest choice is to use each pixel as a token, following a linear 
projection. However, the memory required by a standard transformer implementa- 
tion grows quadratically with the number of input tokens, and so this approach is 
generally infeasible. Instead, the most common approach to tokenization is to split 
the image into a set of patches of the same size. Suppose the images have dimension 
x € R4¥*WxC where H and W are the height and width of the image in pixels 
and C is the number of channels (where typically C = 3 for R, G, and B colours). 
Each image is split into non-overlapping patches of size P x P (where P = 16 is 
a common choice) and then ‘flattened’ into a one-dimensional vector, which gives a 


representation x, € RN *(P °C) where N = H W/P? is the total number of patches 
for one image. The ViT architecture is shown in Figure 12.22. 

Another approach to tokenization is to feed the image through a small convolu- 
tional neural network (CNN). This can down-sample the image to give a manageable 
number of tokens each represented by one of the network outputs. For example a typ- 
ical ResNet18 encoder architecture down-samples an image by a factor of 8 in both 
the height and width dimensions, giving 64 times fewer tokens than pixels. 

We also need a way to encode positional information in the tokens. It is pos- 
sible to construct explicit positional embeddings that encode the two-dimensional 
positional information of the image patches, but in practice this does not generally 
improve performance, and so it is most common to just use learned positional em- 
beddings. In contrast to the transformers used for natural language, vision trans- 
formers generally take a fixed number of tokens as input, which avoids the problem 
of learned positional encodings not generalizing to inputs of a different size. 

A vision transformer has a very different architectural design compared to a 
CNN. Although strong inductive biases are baked into a CNN model, the only two- 
dimensional inductive bias in a vision transformer is due to the patches used to tok- 
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Figure 12.22 Illustration of the vision transformer architecture for a classification task. Here a learnable (class) 
token is included as an additional input, and the associated output is transformed by a linear layer with a softmax 
activation, denoted by LSM, to give the final class-vector output c. 
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enize the input. A transformer therefore generally requires more training data than a 
comparable CNN as it has to learn the geometrical properties of images from scratch. 
However, because there are no strong assumptions about the structure of the inputs, 
transformers are often able to converge to a higher accuracy. This provides another 
illustration of the trade-off between inductive bias and the scale of the training data 
(Sutton, 2019). 


12.4.2 Generative image transformers 


In the language domain, the most impressive results have come when trans- 
formers are used as an autoregressive generative model for synthesizing text. It is 
therefore natural to ask whether we can also use transformers to synthesize realistic 
images. Since natural language is inherently sequential, it fits neatly into the au- 
toregressive framework, whereas images have no natural ordering of their pixels so 
that it is not as intuitive that decoding them autoregressively would be useful. How- 
ever, any distribution can be decomposed into a product of conditionals, provided we 
first define some ordering of the variables. Thus, the joint distribution over ordered 
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Illustration of a raster scan that defines a specific linear 
ordering of the pixels in a two-dimensional image. 


variables x,,...,X, can be written 
N 
Prstni =|] ely ya) (12.37) 
n=1 


This factorization is completely general and makes no restrictions on the form of the 
individual conditional distributions p(x,,|x1,...,Xn—1). 

For an image we can choose x,, to represent the nth pixel as a three-dimensional 
vector of the RGB values. We now need to decide on an ordering for the pixels, and 
one widely used choice is called a raster scan as illustrated in Figure 12.23. A 
schematic illustration of an image being generated using an autoregressive model, 
based on a raster-scan ordering, is shown in Figure 12.24. 

Note that the use of autoregressive generative models of images predates the 
introduction of transformers. For example, PixelCNN (Oord et al., 2016) and Pixel- 
RNN (Oord, Kalchbrenner, and Kavukcuoglu, 2016) used bespoke masked convolu- 
tion layers that preserve the conditional independence defined for each pixel by the 
corresponding term on the right-hand side of 12.37. 

Representations of an image using continuous values can work well in discrim- 
inative tasks. However, much better results are obtained for image generation by 
using discrete representations. Continuous conditional distributions learned by max- 
imum likelihood, such as Gaussians for which the negative log likelihood function 
is a sum-of-squares error function, tend to learn averages of the training data, lead- 
ing to blurry images. Conversely, discrete distributions can handle multimodality 
with ease. For example, one of the conditional distributions p(x,,|x1,...,Xn—1) in 
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Figure 12.24 An illustration of how an image can be sampled from an autoregressive model. The first pixel is 
sampled from the marginal distribution p(x11), the second pixel from the conditional distribution p(x12|x11), and 
so on in raster scan order until we have a complete image. 
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(12.37) might learn that a pixel could be either black or white, whereas a regression 
model might learn that the pixel should be grey. 

However, working with discrete spaces also brings its challenges. The R, G, and 
B values of image pixels are typically represented with at least 8 bits of precision, 
so that each pixel has 274 ~ 16M possible values. Learning a conditional softmax 
distribution over a such a high-dimensional space is infeasible. 

One way to address the problem of the high dimensionality is to use the tech- 
nique of vector quantization, which can be viewed as a form of data compression. 
Suppose we have a set of data vectors x1, ..., Xpy each of dimensionality D, which 
might, for example, represent image pixels, and we then introduce a set of K code- 
book vectors C = ¢,,...,€K also of dimensionality D, where typically K < D. 
We now approximate each data vector by its nearest codebook vector according to 
some similarity metric, usually Euclidean distance, so that 


Xn — arg min ||x,, — cg||’. (12.38) 
ckEC 


Since there are K codebook vectors, we can represent each x,, by a one-hot encoded 
k-dimensional vector, and since we can choose the value of K, we can control the 
trade-off between more accurate representation of the data, by using a larger value 
of K, or greater compression, by using a smaller value of K. 

We can therefore take the original image pixels and map them into the lower- 
dimensional codebook space. An autoregressive transformer can then be trained to 
generate a sequence of codebook vectors, and this sequence can be mapped back into 
the original image space by replacing each codebook index k with the corresponding 
D-dimensional codebook vector cx. 

Autoregressive transformers were first applied to images in ImageGPT (Chen, 
Radford, et al., 2020). Here each pixel is treated as one of a discrete set of three- 
dimensional colour codebook vectors, each corresponding to a cluster in a K-means 
clustering of the colour space. A one-hot encoding therefore gives discrete tokens, 
analogous to language tokens, and allows the transformer to be trained in the same 
way as language models, with a next-token classification objective. This is a power- 
ful objective for representation learning for subsequent fine-tuning, again in a similar 
way to language modelling. 

Using the individual pixels as tokens directly, however, can lead to high com- 
putational cost since a forward pass is required per pixel, which means that both 
training and inference scale poorly with image resolution. Also, using individual 
pixels as inputs means that low-resolution images have to be used to give a reason- 
able context length when decoding the pixels later in the raster scan. As we saw with 
the ViT model, it is preferable to use patches of the image as tokens instead of pixels, 
as this can result in dramatically fewer tokens and therefore facilitates working with 
higher-resolution images. As before, we need to work with a discrete space of token 
values due to the potential multimodality of the conditional distributions. Again, this 
raises the challenge of dimensionality, which is now much more severe with patches 
than with individual pixels since the dimensionality is exponential with respect to 
the number of pixels in the patch. For example, even with just two possible pixel 
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Figure 12.25 An example mel spectrogram of a humpback whale song. [Source data copyright ©2013- 
2023, librosa development team.] 


tokens, representing black and white, and patches of size 16 x 16, we would have a 
dictionary of patch tokens of size 27°° ~ 107”. 

Once again we turn to vector quantization to address the challenge of dimension- 
ality. The codebook vectors can be learned from a data set of image patches using 
simple clustering algorithms such as -means or with more sophisticated meth- 
ods such as fully convolutional networks (Oord, Vinyals, and Kavukcuoglu, 2017; 
Esser, Rombach, and Ommer, 2020) or even vision transformers (Yu et al., 2021). 
One problem with learning to map each patch to a discrete set of codes and back 
again, is that vector quantization is a non-differentiable operation. Fortunately we 
can use a technique called straight-through gradient estimation (Bengio, Léonard, 
and Courville, 2013), which is a simple approximation that just copies the gradients 
through the non-differentiable function during backpropagation. 

The use of autoregressive transformers to generate images can be extended to 
videos by treating a video as one long sequence of these vector-quantized tokens 
(Rakhimov et al., 2020; Yan et al., 2021; Hu et al., 2023). 


12.4.3 Audio data 


We next look at the application of transformers to audio data. Sound is gener- 
ally stored as a waveform obtained by measuring the amplitude of the air pressure 
at regular time intervals. Although this waveform could be used directly as input to 
a deep learning model, in practice it is more effective to pre-process it into a mel 
spectrogram. This is a matrix whose columns represent time steps and whose rows 
correspond to frequencies. The frequency bands follow a standard convention that 
was chosen through subjective assessment to give equal perceptual differences be- 
tween successive frequencies (the word ‘mel’ comes from melody). An example of 
a mel spectrogram is shown in Figure 12.25. 

One application for transformers in the audio domain is classification in which 
segments of audio are assigned to one of a number of predefined categories. For 
example, the AudioSet data set (Gemmeke et al., 2017) is a widely used benchmark. 
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It contains classes such as ‘car’, ‘animal’, and ‘laughter’. Until the development of 
the transformer, the state-of-the-art approach for audio classification was based on 
mel spectrograms treated as images and used as the input to a convolutional neural 
network (CNN). However, although a CNN is good at understanding local relation- 
ships, one drawback is that it struggles with longer-range dependencies, which can 
be important in processing audio. 

Just as transformers replaced RNNs as the state-of-the-art in natural language 
processing, they have also come to replace CNNs for tasks such as audio classifica- 
tion. For example, a transformer encoder model of identical structure to that used 
for both language and vision, as shown in Figure 12.18, can be used to predict the 
class of audio inputs (Gong, Chung, and Glass, 2021). Here the mel spectrogram 
is viewed as an image which is then tokenized. This is done by splitting the image 
into patches in a similar way to vision transformers, possibly with some overlap so 
as not to lose any important neighbourhood relations. Each patch is then flattened, 
meaning it is converted to a one-dimensional array, in this case of length 256. A 
unique positional encoding is then added to each token, a specific (class) token is 
appended, and the tokens are then fed through the transformer encoder. The output 
token corresponding to the (class) input token from the last transformer layer can 
then be decoded using a linear layer followed by a softmax activation function, and 
the whole model can be trained end-to-end using a cross-entropy loss. 


12.4.4 Text-to-speech 


Classification is not the only task that deep learning, and more specifically the 
transformer architecture, has revolutionized in the audio domain. The success of 
transformers at synthesizing speech that imitates the voice of a given speaker is an- 
other demonstration of their versatility, and their application to this task is an infor- 
mative case study in how to apply transformers in a new context. 

Generating speech corresponding to a given passage of text is known as fext- 
to-speech synthesis. A more traditional approach would be to collect recordings 
of speech from a given speaker and train a supervised regression model to predict 
the speech output, possibly in the form of a mel spectrogram, from corresponding 
transcribed text. During inference, the text for which we would like to synthesize 
speech is presented as input and the resulting mel spectrogram output can then be 
decoded back to an audio waveform since this is a fixed mapping. 

This approach has a few major drawbacks, however. First, if we predict speech 
at a low level, for example using sub-word components known as phonemes, a 
larger context is needed to make the resulting sentences sound fluid. However, if we 
predict longer segments, then the space of possible inputs grows significantly, and an 
infeasible amount of training data might be required to achieve good generalization. 
Second, this approach does not transfer knowledge across speakers, and so a lot of 
data will be required for each new speaker. Finally, the problem is really a generative 
modelling task, as there are multiple correct speech outputs for a given speaker and 
text pair, so regression may not be suitable since it tends to average over target values. 

If instead we treat audio data in the same way as natural language and frame 
text-to-speech as a conditional language modelling task, then we should be able to 
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Figure 12.26 A diagram showing the high-level architecture of Vall-E. The input to the transformer model con- 
sists of standard text tokens, which prompt the model as to what words the synthesized speech should contain, 
together with acoustic prompt tokens that determine the speaker style and tone information. The sampled model 
output tokens are decoded back to speech with the learned decoder. For simplicity, the positional encodings and 
linear projections are not shown. 
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train the model in much the same way as with text-based large language models. 
There are two main implementation details that need to be addressed. The first is 
how to tokenize the training data and decode the predictions, and the second is how 
to condition the model on the speaker’s voice. 

One approach to text-to-speech synthesis that makes use of transformers and lan- 
guage modelling techniques is Vall-E (Wang et al., 2023). New text can be mapped 
into speech in the voice of a new speaker using only a few seconds of sample speech 
from that person. Speech data is converted into a sequence of discrete tokens from 
a learned dictionary or codebook obtained using vector quantization, and we can 
think of these tokens as analogous to the one-hot encoded tokens from the natural 
language domain. The input consists of text tokens from a passage of text whereas 
the target outputs for training consist of the corresponding speech tokens. Additional 
speech tokens from a short segment of unrelated speech from the same speaker are 
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appended to the input text tokens, as illustrated in Figure 12.26. By including exam- 
ples from many different speakers, the system can learn to read out a passage of text 
while imitating the voice represented by the additional speech input tokens. Once 
trained the system can be presented with new text, along with audio tokens from a 
brief segment of speech captured from a new speaker, and the resulting output tokens 
can be decoded, using the same codebook used during training, to create a speech 
waveform. This allows the system to synthesize speech corresponding to the input 
text in the voice of the new speaker. 


12.4.5 Vision and language transformers 


We have seen how to generate discrete tokens for text, audio, and images, and so 
it is a natural next step to ask if we can train a model with input tokens of one modal- 
ity and output tokens of another, or whether we can have a combination of different 
modalities for either inputs or outputs or both. We will focus on the combination 
of text and vision data as this is the most widely studied example, but in principle 
the approaches discussed here could be applied to other combinations of input and 
output modalities. 

The first requirement is that we have a large data set for training. The LAION- 
400M data set (Schuhmann et al., 2021) has greatly accelerated research in text-to- 
image generation and image-to-text captioning in much the same way that ImageNet 
was critical in the development of deep image classification models. Text-to-image 
generation is actually much like the unconditional image generation we have looked 
at so far, except that we also allow the model to take as input the text information to 
condition the generation process. This is straightforward when using transformers as 
we can simply provide the text tokens as additional input when decoding each image 
token. 

This approach can also be viewed as treating the text-to-image problem as a 
sequence-to-sequence language modelling problem, such as machine translation, ex- 
cept that the target tokens are discrete image tokens rather than language tokens. It 
therefore makes sense to choose a full encoder-decoder transformer model, as shown 
in Figure 12.20, in which X corresponds to the input text tokens and Y corresponds 
to the output image tokens. This is the approach taken in a model called Parti (Yu et 
al., 2022) in which the transformer is scaled to 20 billion parameters while showing 
consistent performance improvements with increasing model size. 

A lot of research has also been done on using pre-trained language models, 
and modifying or fine-tuning them so that they can also accept visual data as in- 
put (Alayrac et al., 2022; Li et al., 2022). These approaches largely use bespoke 
architectures, along with continuous-valued image tokens, and therefore are not nat- 
ural fits for also generating visual data. Moreover, they cannot be used directly if we 
wish to include new modalities such as audio tokens. Although this is a step towards 
multimodality, we would ideally like to use both text and image tokens as both input 
and output. The simplest approach is to treat everything as a sequence of tokens as 
if this were natural language but with a dictionary that is the concatenation of a lan- 
guage token dictionary and the image token codebook. We can then treat any stream 
of audio and visual data as simply a sequence of tokens. 
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Interleaved texts and images Generated images 


Edit the image 
following the text 
instruction 


Text-Guided 
Editing 
Pix2pix, RDEdit 


Make high quality 


Scribble 


description 


Make high quality 
image from pose 
features and text 
description 


Grounded Generation 
Poses 


Text to image task 


The common kingfisher (Alcedo 
image from children's atthis) also known as the Eurasian 
scribbles and text kingfisher and river kingfisher sitting 


Make her an alien 


on branch 


A woman practices yoga on a 


cross-legged sport mat È M 3 3 f g i 
=| EG 
= 


Leon 


Generated images 


Spatially Grounded : Fabricate an image of a contemporary kitchen with a refrigerator at 


the location (50, 50) -> (100, 100), and stoves at the location (80, 80) -> (200, 200) 


Mornring? 


How-to-write : A white sign that says “morning” 


Image to text tasks 


ae Caption: Deserve te given image 


Generated text 


A beautiful view of a city from across a river. 


A view of tall buildings in a city. The photo is 
taken from a park across a river. We can see 
a bridge over the river. 


Long Caption: Describe the given image in very detail 


VQA: Question: what time of the day is the photo taken? 
Sunset time 
Reasoning: Question: Does this passage describe the 

weather or the climate? Context: Figure: Des Moines. The 
temperature recorded ...Please explain your answer. 


Weather. Because the atmosphere is the layer 
of air that surrounds Earth. Both weather and 
climate tell you about the atmosphere. ... 


Figure 12.27 Examples of the CM3Leon model performing a variety of different tasks in the joint space of text 
and images. [From (Yu et al., 2023) with permission.] 
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12.1 


In CM3 (Aghajanyan et al., 2022) and CM3Leon (Yu et al., 2023), a variation of 
language modelling is used to train on HTML documents containing both image and 
text data taken from online sources. When this large quantity of training data was 
combined with a scalable architecture, the models became very powerful. Moreover, 
the multimodal nature of the training means that the models are very flexible. Such 
models are capable of completing many tasks that otherwise might require task- 
specific model architectures and training regimes, such as text-to-image generation, 
image-to-text captioning, image editing, text completion, and many more, including 
anything a regular language model is capable of. Examples of the CM3Leon model 
completing instances of a few different tasks are shown in Figure 12.27. 


(x x) Consider a set of coefficients anm, for m = 1,..., N, with the properties that 


iSO (12.39) 
» tne (12.40) 
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12.2 


12.3 


12.4 


12.5 


12.6 


12.7 


12.8 


By using a Lagrange multiplier show that the coefficients must also satisfy 


anm <1 for n=1,...,N. (12.41) 


(x) Verify that the softmax function (12.5) satisfies the constraints (12.3) and (12.4) 
for any values of the vectors x,,...,Xy. 


(x) Consider the input vectors x, in the simple transformation defined by (12.2), in 
which the weighting coefficients anm are defined by (12.5). Show that if all the input 
vectors are orthogonal, so that Xl Rm = 0 for n Æ m, then the output vectors will 
simply be equal to the input vectors so that yn = x, for n = 1,..., N. 


(x) Consider two independent random vectors a and b each of dimension D and 
each being drawn from a Gaussian distribution with zero mean and unit variance 
N (-|0, I). Show that the expected value of (ab)? is given by D. 


(xxx) Show that multi-head attention defined by (12.19) can be rewritten in the form 


H 
Y= x H,xXw”) (12.42) 
h=1 


where H, is given by (12.15) and we have defined 
wh = ww), (12.43) 


Here we have partitioned the matrix W°) horizontally into sub-matrices denoted 
wo each of dimension D, x D, corresponding to the vertical segments of the 
concatenated attention matrix. Since D, is typically smaller than D, for example 
D, = D/H is a common choice, this combined matrix is rank deficient. Therefore, 


using a fully flexible matrix to replace wwe would not be equivalent to the 
original formulation given in the text. 


(x x) Express the self-attention function (12.14) as a fully connected network in the 
form of a matrix that maps the full input sequence of concatenated word vectors 
into an output vector of the same dimension. Note that such a matrix would have 
O(N? D?) parameters. Show that the self-attention network corresponds to a sparse 
version of this matrix with parameter sharing. Draw a sketch showing the structure 
of this matrix, indicating which blocks of parameters are shared and which blocks 
have all elements equal to zero. 


(x) Show that if we omit the positional encoding of input vectors then the outputs 
of a multi-head attention layer defined by (12.19) are equivariant with respect to a 
reordering of the input sequence. 


(x * x) Consider two D-dimensional unit vectors a and b, satisfying ||a|| = 1 and 
||b|| = 1, drawn from a random distribution. Assume that the distribution is sym- 
metrical around the origin, i.e., it depends only on the distance from the origin and 


12.9 


12.10 


12.11 


12.12 


12.13 


12.14 


12.15 
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not the direction. Show that for large values of D the magnitude of the cosine of the 
angle between these vectors is close to zero and hence that these random vectors are 
nearly orthogonal in a high-dimensional space. To do this, consider an orthonormal 
basis set {u;} where ufu; = 6;; and express a and b as expansions in this basis. 


(x x) Consider a position encoding in which the input token vector x is concatenated 
with a position-encoding vector e. Show that when this concatenated vector under- 
goes a general linear transformation by multiplication using a matrix, the result can 
be expressed as the sum of a linearly transformed input and a linearly transformed 
position vector. 


(xx) Show that the positional encoding defined by (12.25) has the property that, 
for a fixed offset k, the encoding at position n + k can be represented as a linear 
combination of the encoding at position n with coefficients that depend only on k 
and not on n. To do this make use of the following trigonometric identities: 


cos(A + B) = cos A cos B — sin Asin B (12.44) 
sin(A + B) = cos Asin B + sin A cos B. (12.45) 


Show that if the encoding is based purely on sine functions, without cosine functions, 
then this property no longer holds. 


(x) Consider the bag-of-words model (12.28) in which each of the component distri- 
butions p(x,,) is given by a general probability table that is shared across all words. 
Show that the maximum likelihood solution, given a training set of vectors, is given 
by a table whose entries are the fractions of times each word occurs in the training 
set. 


(x) Consider the autoregressive language model given by (12.31) and suppose that 
the terms p(xX,|X1,...,Xn—1) on the right-hand side are represented by general 
probability tables. Show that the number of entries in these tables grows exponen- 
tially with the value of n. 


(x) When using n-grams it is usual to train the n-gram and (n — 1)-gram models at 
the same time and then compute the conditional probability using the product rule of 
probability in the form 


PL(Xn-L+1, E Xn) 
PL-1(Xn-L+1, ies Xn—1) 


(12.46) 


P(Xn|Xn-L+1, S hai) = 


Explain why this is more convenient than storing the left-hand-side directly, and 
show that to obtain the correct probabilities the final token from each sequence must 
be omitted when evaluating pr—1(---). 


(x x) Write down pseudo-code for the inference process in a trained RNN with an 
architecture of the form depicted in Figure 12.13. 


(x x) Consider a sequence of two tokens yı and y2 each of which can take the states 
A or B. The table below shows the joint probability distribution p(y1, y2): 
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12.16 


Y=A y =B 


We see that the most probable sequence is y; = B, y2 = B and that this has prob- 
ability 0.4. Using the sum and product rules of probability, write down the values 
of the marginal distribution p(yı) and the conditional distribution p(y2|y1). Show 
that if we first maximize p(y) to give a value yf and then subsequently maximize 
p(y2|yz) then we obtain a sequence that is different from the overall most probable 
sequence. Find the probability of the sequence. 


(x) The BERT-Large model (Devlin et al., 2018) has a maximum input length of 512 
tokens, each of dimensionality D = 1,024 and taken from a vocabulary of 30,000. It 
has 24 transformer layers each with 16 self-attention heads with Dy = Dk = D, = 
64, and the MLP position-wise networks have two layers with 4,096 hidden nodes. 
Show that the total number of parameters in the BERT encoder transformer language 
model is approximately 340 million. 


Check for 
updates 


13 


Graph Neural 
Networks 


In previous chapters we have encountered structured data in the form of sequences 
and images, corresponding to one-dimensional and two-dimensional arrays of vari- 
ables respectively. More generally, there are many types of structured data that are 
best described by a graph as illustrated in Figure 13.1. In general a graph consists of 
a set of objects, known as nodes, connected by edges. Both the nodes and the edges 
can have data associated with them. For example, in a molecule the nodes and edges 
are associated with discrete variables corresponding to the types of atom (carbon, ni- 
trogen, hydrogen, etc.) and the types of bonds (single bond, double bond, etc.). For a 
rail network, each railway line might be associated with a continuous variable given 
by the average journey time between two cities. Here we are assuming that the edges 
are symmetrical, for example that the journey time from London to Cambridge is the 
same as the journey time from Cambridge to London. Such edges are depicted by 
undirected links between the nodes. For the worldwide web the edges are directed 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 407 
C. M. Bishop, H. Bishop, Deep Learning, https://doi.org/10.1007/978-3-03 1-45468-4_ 13 


408 


Iy 
N 

son 
yy = 


13. GRAPH NEURAL NETWORKS 


(a) 


Cambridge 


Bristol London 


(b) 


Figure 13.1 Three examples of graph-structured data: (a) the caffeine molecule consisting of atoms connected 
by chemical bonds, (b) a rail network consisting of cities connected by railway lines, and (c) the worldwide web 
consisting of pages connected by hyperlinks. 
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since if there is a hyperlink on page A that points to page B there is not necessarily 
a hyperlink on page B pointing back to page A. 

Other examples of graph-structured data include a protein interaction network, 
in which the nodes are proteins and the edges express how strongly pairs of pro- 
teins interact, an electrical circuit where the nodes are components and the edges 
are conductors, or a social network where the nodes are people and the edges are 
‘friendships’. More complex graphical structures are also possible, for example the 
knowledge graph inside a company comprises multiple different kinds of nodes such 
as people, documents, and meetings, along with multiple kinds of edges capturing 
different properties such as a person being present at a meeting or a document refer- 
encing another document. 

In this chapter we explore how to apply deep learning to graph-structured data. 
We have already encountered an example of structured data when we discussed im- 
ages, in which the individual elements of an image data vector x correspond to pixels 
on a regular grid. An image is therefore a special instance of graph-structured data 
in which the nodes are the pixels and the edges describe which pixels are adjacent. 
Convolutional neural networks (CNNs) take this structure into account, incorporat- 
ing prior knowledge of the relative positions of the pixels, together with the equiv- 
ariance of properties such as segmentation and the invariance of properties such as 
classification. We will use CNNs for images as a source of inspiration to construct 
more general approaches to deep learning for graphical data known as graph neu- 
ral networks (Zhou et al., 2018; Wu et al., 2019; Hamilton, 2020; Veličković, 2023). 
We will see that a key consideration when applying deep learning to graph-structured 
data is to ensure either equivariance or invariance with respect to a reordering of the 
nodes in the graph. 
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13.1. Machine Learning on Graphs 


There are many kinds of applications that we might wish to address using graph- 
structured data, and we can group these broadly according to whether the goal is 
to predict properties of nodes, of edges, or of the whole graph. An example of 
node prediction would be to classify documents according to their topic based on the 
hyperlinks and citations between the documents. 

Regarding edges we might, for example, know some of the interactions in a pro- 
tein network and would like to predict the presence of any additional ones. Such 
tasks are called edge prediction or graph completion tasks. There are also tasks 
where the edges are known in advance and the goal is to discover clusters or ‘com- 
munities’ within the graph. 

Finally, we may wish to predict properties that relate to the graph as a whole. For 
example, we might wish to predict whether a particular molecule is soluble in water. 
Here instead of being given a single graph we will have a data set of different graphs, 
which we can view as being drawn from some common distribution, in other words 
we assume that the graphs themselves are independent and identically distributed. 
Such tasks can be considered as graph regression or graph classification tasks. 

For the molecule solubility classification example, we might be given a labelled 
training set of molecules, along with a test set of new molecules whose solubility 
needs to be predicted. This is a standard example of an inductive task of the kind 
we have seen many times in previous chapters. However, some graph prediction 
examples are transductive in which we are given the structure of the entire graph 
along with labels for some of the nodes and the goal is to predict the labels of the 
remaining nodes. An example would be a large social network in which our goal is to 
classify each node as either a real person or an automated bot. Here a small number 
of nodes might be manually labelled, but it would be prohibitive to investigate every 
node individually in a large and ever-changing social network. During training, we 
therefore have access to the whole graph along with labels for a subset of the nodes, 
and we wish to predict the labels for the remaining nodes. This can be viewed as a 
form of semi-supervised learning. 

As well as solving prediction tasks directly, we can also use deep learning on 
graphs to discover useful internal representations that can subsequently facilitate a 
range of downstream tasks. This is known as graph representation learning. For 
example we could seek to build a foundation model for molecules by training a deep 
learning system on a large corpus of molecular structures. The goal is that once 
trained, such a foundation model can be fine-tuned to specific tasks by using a small, 
labelled data set. 

Graph neural networks define an embedding vector for each of the nodes, usually 
initialized with the observed node properties, which are then transformed through a 
series of learnable layers to create a learned representation. This is analogous to 
the way word embeddings, or tokens, are processed through a series of layers in the 
transformer to give a representation that better captures the meaning of the words 
in the context of the rest of the text. Graph neural networks can also use learned 
embeddings associated with the edges and with the graph as a whole. 
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Figure 13.2 An example of an adjacency matrix showing (a) an example of a graph with five nodes, (b) the 
associated adjacency matrix for a particular choice of node order, and (c) the adjacency matrix corresponding to 
a different choice for the node order. 
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13.1.1 Graph properties 


In this chapter we will focus on simple graphs where there is at most one edge 
between any pair of nodes, where the edges are undirected, and where there are no 
self-edges that connect a node to itself. This suffices to introduce the key concepts 
of graph neural networks, and it also encompasses a wide range of practical applica- 
tions. These concepts can then be applied to more complex graphical structures. 

We begin by introducing some notation associated with graphs and by defining 
some important properties. A graph G = (V, E) consists of a set of nodes or vertices, 
denoted by V, along with a set of edges or links, denoted by £. We index the nodes 
by n = 1,... N, and we write the edge from node n to node m as (n,m). If two 
nodes are linked by an edge they are called neighbours, and the set of all neighbours 
of node n is denoted by N (n). 

In addition to the graph structure, we usually also have observed data associated 
with the nodes. For each node n we can represent the corresponding node variables 
as a D-dimensional column vector x,, and we can group these into a data matrix X 
of dimensionality N x D in which row n is given by xT. There may also be data 
variables associated with the edges in the graph, although to start with we will focus 
just on node variables. 


13.1.2 Adjacency matrix 


A convenient way to specify the edges in a graph is to use an adjacency matrix 
denoted by A. To define the adjacency matrix we first have to choose an ordering for 
the nodes. If there are N nodes in the graph, we can index them using n = 1,..., N. 
The adjacency matrix has dimensions N x N and contains a 1 in every location n, m 
for which there is an edge going from node n to node m, with all other entries being 
0. For graphs with undirected edges, the adjacency matrix will be symmetric since 
the presence of an edge from node n to node m implies that there is also an edge 
from node m to node n, and therefore Amn = Anm for all n and m. An example of 
an adjacency matrix is shown in Figure 13.2. 

Since the adjacency matrix defines the structure of a graph, we could consider 
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using it directly as the input to a neural network. To do this we could ‘flatten’ the ma- 
trix, for example by concatenating the columns into one long column vector. How- 
ever, a major problem with this approach is that the adjacency matrix depends on the 
arbitrary choice of node ordering, as seen in Figure 13.2. Suppose for instance that 
we want to predict the solubility of a molecule. This clearly should not depend on 
the ordering assigned to the nodes when writing down an adjacency matrix. Because 
the number of permutations increases factorially with the number of nodes, it is im- 
practical to try to learn permutation invariance by using large data sets or by data 
augmentation. Instead, we should treat this invariance property as an inductive bias 
when constructing a network architecture. 


13.1.3 Permutation equivariance 


We can express node label permutation mathematically by introducing the con- 
cept of a permutation matrix P, which has the same size as the adjacency matrix 
and which specifies a particular permutation of a node ordering. It contains a single 
1 in each row and a single 1 in each column, with 0 in all the other elements, such 
that a 1 in position n,m indicates that node n will be relabelled as node m after 
the permutation. Consider, for example, the permutation from (A, B,C, D, E) > 
(C, E, A, D, B) corresponding to the two choices of node ordering in Figure 13.2. 
The corresponding permutation matrix takes the form 


0 0 1 0 0 
0 00 0 1 
P=; 100 0 0 (13.1) 
00 0 1 0 
0 10 0 0 


We can define the permutation matrix more formally as follows. First we in- 
troduce the standard unit vector u,, for n = 1,...,N. This is a column vector 
in which all elements are 0 except element n, which equals 1. In this notation the 
identity matrix is given by 


(13.2) 
T 
Uy 
We can now introduce a permutation function 7(-) that maps n to m = m(n). The 
associated permutation matrix is given by 


T 
Uz(1) 
ut 
P= m2) |. (13.3) 
T 
UuZ(N) 


When we reorder the labelling on the nodes of a graph, the effect on the corre- 
sponding node data matrix X is to permute the rows according to 7(-), which can be 
achieved by pre-multiplication by P to give 
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X = PX. (13.4) 


For the adjacency matrix, both the rows and the columns become permuted. Again 
the rows can be permuted using pre-multiplication by P whereas the columns are 
permuted using post-multiplication by PT, giving a new adjacency matrix: 


A = PAP". (13.5) 


When applying deep learning to graph-structured data, we will need to repre- 
sent the graph structure in numerical form so that it can be fed into a neural network, 
which requires that we assign an ordering to the nodes. However, the specific or- 
dering we choose is arbitrary and so it will be important to ensure that any global 
property of the graph does not depend on this ordering. In other words, the network 
predictions must be invariant to node label reordering, so that 


y(X, A) = y(X, A) Invariance (13.6) 


where y(-,-) is the output of the network. 

We may also want to make predictions that relate to individual nodes. In this 
case, if we reorder the node labelling then the corresponding predictions should show 
the same reordering so that a given prediction is always associated with the same 
node irrespective of the choice of order. In other words, node predictions should be 
equivariant with respect to node label reordering. This can be expressed as 


y(X, A) = Py(X, A) Equivariance (13.7) 


where y(-,-) is a vector of network outputs, with one element per node. 


Neural Message-Passing 


Ensuring invariance or equivariance under node label permutations is a key design 
consideration when we apply deep neural networks to graph-structured data. An- 
other consideration is that we want to exploit the representational capabilities of deep 
neural networks and so we retain the concept of a ‘layer’ as a computational trans- 
formation that can be applied repeatedly. If each layer of the network is equivariant 
under node reordering then multiple layers applied in succession will also exhibit 
equivariance, while allowing each layer of the network to be informed by the graph 
structure. 

For networks whose outputs represent node-level predictions, the whole network 
will be equivariant as required. If the network is being used to predict a graph- 
level property then a final layer can be included that is invariant to permutations 
of its inputs. We also want to ensure that each layer is a highly flexible nonlinear 
function and is differentiable with respect to its parameters so that it can be trained 
by stochastic gradient descent using gradients obtained by automatic differentiation. 

Graphs come in various sizes. For example different molecules can have differ- 
ent numbers of atoms, so a fixed-length representation as used for standard neural 
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(a) (b) 


Figure 13.3 A convolutional filter for images can be represented as a graph-structured computation. (a) A filter 
computed by node 7 in layer / + 1 of a deep convolutional network is a function of the activation values in layer 
l over a local patch of pixels. (b) The same computation structure expressed as a graph showing ‘messages’ 
flowing into node z from its neighbours. 
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Section 10.2 


networks is unsuitable. A further requirement is therefore that the network should be 
able to handle variable-length inputs, as we saw with transformer networks. Some 
graphs can be very large, for example a social network with many millions of par- 
ticipants, and so we also want to construct models that scale well. Not surpris- 
ingly, parameter sharing will play an important role, both to allow the invariance and 
equivariance properties to be built into the network architecture but also to facilitate 
scaling to large graphs. 


13.2.1 Convolutional filters 


To develop a framework that meets all of these requirements, we can seek inspi- 
ration from image processing using convolutional neural networks. First note that 
an image can be viewed as a specific instance of graph-structured data, in which the 
nodes are the pixels and the edges represent pairs of pixels that are adjacent in the 
image, where adjacency includes nodes that are diagonally adjacent as well as those 
that are horizontally or vertically adjacent. 

In a convolutional network, we make successive transformations of the image 
domain such that a pixel at a particular layer computes a function of states of pixels 
in the previous layer through a local function called a filter. Consider a convolutional 
layer using 3 x 3 filters, as illustrated in Figure 13.3(a). The computation performed 
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by a single filter at a single pixel in layer l + 1 can be expressed as 
gy es F\ Doma; M45 (13.8) 


where f(-) is a differentiable nonlinear activation function such as ReLU, and the 
sum over j is taken over all nine pixels in a small patch in layer l. The same function 
is applied across multiple patches in the image, so that the weights w; and bias b are 
shared across the patches (and therefore do not carry the index 7). 

As it stands, (13.8) is not equivariant under reordering of the nodes in layer | 
because the weight vector, with elements w;, is not invariant under permutation of 
its elements. However, we can achieve equivariance with some simple modifications 
as follows. We first view the filter as a graph, as shown in Figure 13.3(b), and 
separate out the contribution from node 7. The other eight 8 nodes are its neighbours 
N (i). We then assume that a single weight parameter Wneigh 1S shared across the 
neighbours so that 


a = Wneigh DE = Weelt 2 ) 4 b (13.9) 
JEN (i) 


where node 7 has its own weight parameter weir. 

We can interpret (13.9) as updating a local representation z; at node 7 by gather- 
ing information from the neighbouring nodes by passing messages from the neigh- 
bouring nodes into node 7. In this case the messages are simply the activations of 
the other nodes. These messages are then combined with information from node 2, 
and the result is transformed using a nonlinear function. The information from the 
neighbouring nodes is aggregated through a simple summation in (13.9), and this is 
clearly invariant to any permutation of the labels associated with those nodes. Fur- 
thermore, the operation (13.9) is applied synchronously to every node in a graph, and 
so if the nodes are permuted then the resulting computations will be unchanged but 
their ordering will be likewise permuted, and hence, this calculation is equivariant 
under node reordering. Note that this depends on the parameters Wyeigh, Wsei¢, and b 
being shared across all nodes. 


13.2.2 Graph convolutional networks 


We now use the convolution example as a template to construct deep neural net- 
works for graph-structured data. Our goal is to define a flexible, nonlinear transfor- 
mation of the node embeddings that is differentiable with respect to a set of weight 
and bias parameters and which maps the variables in layer l into corresponding vari- 
ables in layer l + 1. For each node n in the graph and for each layer l in the net- 
work, we introduce a D-dimensional column vector h ) of node-embedding vari- 
ables, where n = 1,..., N and l = 1,..., L. 

We see that the transformation given by (13.9) first gathers and combines in- 
formation from neighbouring nodes and then updates the node as a function of the 
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Algorithm 13.1: Simple message-passing neural network 


Input: Undirected graph G = (V, £) 
Initial node embeddings {hP = x,,} 


Agegregate(-) function 
Update(-, -) function 
Output: Final node embeddings {h} 


// Iterative message-passing 

for l € {0,...,L—1}do 
z% — Aggregate ({hi : mE N(n)}) 
nit) + Update (a, zi) | 

end for 

return {hi} 


current embedding of the node and the incoming messages. We can therefore view 
each layer of processing as having two successive stages. The first is the aggregation 
stage in which, for each node n, messages are passed to that node from its neigh- 
bours and combined to form a new vector z% ina way that is permutation invariant. 
This is followed by an update step in which the aggregated information from neigh- 
bouring nodes is combined with local information from the node itself and used to 
calculate a revised embedding vector for that node. 

Consider a specific node n in the graph. We first aggregate the node vectors 
from all the neighbours of node n: 


z® = Aggregate ({hW : m € N(n)}). (13.10) 


The form of this aggregation function is very flexible if it is well defined for a vari- 
able number of neighbouring nodes and does not depend on the ordering of those 
nodes. It can potentially contain learnable parameters as long as it is a differentiable 
function with respect to those parameters to facilitate gradient descent training. 

We then use another operation to update the embedding vector at node n: 


h+») _ Update (ni, z®) : (13.11) 


Again, this can be a differentiable function of a set of learnable parameters. Appli- 
cation of the Aggregate operation followed by the Update operation in parallel for 
every node in the graph represents one layer of the network. The node embeddings 
are typically initialized using observed node data so that h = x,,. Note that each 
layer generally has its own independent parameters, although the parameters can also 
be shared across layers. This framework is called a message-passing neural network 
(Gilmer et al., 2017) and is summarized in Algorithm 13.1. 
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13.2.3 Aggregation operators 


There are many possible forms for the Aggregate function, but it must depend 
only on the set of inputs and not on their ordering. It must also be a differentiable 
function of any learnable parameters. The simplest such aggregation function, fol- 
lowing from (13.9), is summation: 


Aggregate ({h : m € N(n)}) = `> hO. (13.12) 
mEN (n) 


A simple summation is clearly independent of the ordering of the neighbouring nodes 
and is also well defined no matter how many nodes are in the neighbourhood set. 
Note that this has no learnable parameters. 

A summation gives a stronger influence over nodes that have many neighbours 
compared to those with few neighbours, and this can lead to numerical issues, par- 
ticularly in applications such as social networks where the size of the neighbourhood 
set can vary by several orders of magnitude. A variation of this approach is to define 
the Aggregation operation to be the average of the neighbouring embedding vectors 
so that 


Aggregate ({hY) : m € N(n)}) = KW `> hO (13.13) 
mEN (n) 


where |N (n)| denotes the number of nodes in the neighbourhood set N (n). How- 
ever, this normalization also discards information about the network structure and is 
provably less powerful than a simple summation (Hamilton, 2020), and so the choice 
of whether to use it depends on the relative importance of node features compared to 
graph structure. 

Another variation of this approach (Kipf and Welling, 2016) takes account of 
the number of neighbours for each of the neighbouring nodes: 


(1) 
Aggregate ({h\) : m E€ N(n)}) = D = 


meniny VN MINNI 


Yet another possibility is to take the element-wise maximum (or minimum) of the 
neighbouring embedding vectors, which also satisfies the desired properties of being 
well defined for a variable number of neighbours and of being independent of their 
order. 

Since each node in a given layer of the network is updated by aggregating infor- 
mation from its neighbours in the previous layer, this defines a receptive field anal- 
ogous to the receptive fields of filters used in CNNs. As information is processed 
through successive layers, the updates to a given node depend on a steadily increas- 
ing fraction of other nodes in earlier layers until the effective receptive field poten- 
tially spans the whole graph as illustrated in Figure 13.4. However, large, sparse 
graphs may require an excessive number of layers before each output is influenced 
by every input. Some architectures therefore introduce an additional “super-node’ 


(13.14) 


Figure 13.4 Schematic illustration of infor- 
mation flow through successive layers of a 
graph neural network. In the third layer a sin- 
gle node is highlighted in red. It receives in- 
formation from its two neighbours in the previ- 
ous layer and those in turn receive informa- 
tion from their neighbours in the first layer. 
As with convolutional neural networks for im- 
ages, we see that the effective receptive field, 
corresponding to the number of nodes shown 
in red, grows with the number of processing 
layers. 
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that connects directly to every node in the original graph to ensure fast propagation 
of information. 

The aggregation operators discussed so far have no learnable parameters. We 
can introduce such parameters if we first transform each of the embedding vectors 
from neighbouring nodes using a multilayer neural network, denoted by MLPg, 
before combining their outputs, where MLP denotes ‘multilayer perceptron’ and @ 
represents the parameters of the network. So long as the network has a structure and 
parameter values that are shared across nodes then this aggregation operator again 
be permutation invariant. We can also transform the combined vector with another 
neural network MLP, with parameters 0, to give an overall aggregation operator: 


Aggregate ({h{) : m € N(n)}) =MLPo | X` MLPg(hM)} (13.15) 
mEN (n) 


in which MLP¿ and MLPg¢ are shared across layer l. Due to the flexibility of MLPs, 
the transformation defined by (13.15) represents a universal approximator for any 
permutation-invariant function that maps a set of embeddings to a single embedding 
(Zaheer et al., 2017). Note that the summation can be replaced by other invariant 
functions such as averages or an element-wise maximum or minimum. 

A special case of graph neural networks arises if we consider a graph having no 
edges, which corresponds simply to an unstructured set of nodes. In this case if we 
use (13.15) for each vector hË in the set, in which the summation is taken over all 


other vectors except h?, then we have a general framework for learning functions 
over unstructured sets of variables known as deep sets. 
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Exercise 13.7 


13.2.4 Update operators 


Having chosen a suitable Aggregate operator, we similarly need to decide on the 
form of the Update operator. By analogy with (13.9) for the CNN, a simple form for 
this operator would be 


non 


Update (bY), 2) = f (Warb + Wreignz\? +b) (13.16) 


where f(-) is a nonlinear activation function such as ReLU applied element-wise to 
its vector argument, and where Weir, Woeign, and b are the learnable weights and 


biases and z% is defined by the Aggregate operator (13.10). 

If we choose a simple summation (13.12) as the aggregation function and if 
we also share the same weight matrix between nodes and their neighbours so that 
Weer = Wreigh, we obtain a particularly simple form of Update operator given by 


ht?) = Update (hY, 2) = f | Wein XO btb]. 317 
meEN (n),n 


The message-passing algorithm is typically initialized by setting h® = xn. 
Sometimes, however, we may want to have an internal representation vector for each 
node that has a higher, or lower, dimensionality than that of xn. Such a represen- 
tation can be initialized by padding the node vectors x„ with additional zeros (to 
achieve a higher dimensionality) or simply by transforming the node vectors using a 
learnable linear transformation to a space of the desired number of dimensions. An 
alternative form of initialization, particularly when there are no data variables asso- 
ciated with the nodes, is to use a one-hot vector that labels the degree of each node 
(i.e., the number of neighbours). 

Overall, we can represent a graph neural network as a sequence of layers that 
successively transform the node embeddings. If we group these embeddings into a 
matrix H whose nth row is the vector hT, which is initialized to the data matrix X, 
then we can write the successive transformations in the form 


H® = F (X, A, WC) 
H =F (H®, A, we) 


H” = F (H4? A, Ww) (13.18) 


where A is the adjacency matrix, and W") represents the complete set of weight 
and biases in layer l of the network. Under a node reordering defined by a permu- 
tation matrix P, the transformation of the node embeddings computed by layer / is 
equivariant: 

PH = F (PH), PAPT, W0). (13.19) 


As a consequence, the complete network computes an equivariant transformation. 
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13.2.5 Node classification 


A graph neural network can be viewed as a series of layers each of which trans- 
forms a set of node-embedding vectors {h\!’} into a new set {hẸ +” } of the same 
size and dimensionality. After the final convolutional layer of the network, we need 
to obtain predictions so that we can define a cost function for training and also for 
making predictions on new data using the trained network. 

Consider first the task of classifying the nodes in a graph, which is one of the 
most common uses for graph neural networks. We can define an output layer, some- 
times called a readout layer, which calculates a softmax function for each node cor- 
responding to a classification over C classes, of the form 


ThE) 
poc a (13.20) 

5 exp(w; hr ) 
where {w;} is a set of learnable weight vectors and i = 1,...,C. We can then 


define a loss function as the sum of the cross-entropy loss across all nodes and all 


classes: e 
t=- So So yin (13.21) 


NEVerain t= 1 


where {t,,;} are target values with a one-hot encoding for each value of n. Because 
the weight vectors {w;} are shared across the output nodes, the outputs yn; are 
equivariant to permutation of the node ordering, and hence the loss function (13.21) 
is invariant. If the goal is to predict continuous values at the outputs then a sim- 
ple linear transformation can be combined with a sum-of-squares error to define a 
suitable loss function. 

The sum over n in (13.21) is taken over the subset of the nodes denoted by Virain 
and used for training. We can distinguish between three types of nodes as follows: 


1. The nodes Virain are labelled and included in the message-passing operations 
of the graph neural network and are also used to compute the loss function 
used for training. 


2. There is potentially also a transductive subset of nodes denoted by Viyans. 
which are unlabelled and which do not contribute to the evaluation of the 
loss function used for training. However, they still participate in the message- 
passing operations during both training and inference, and their labels may be 
predicted as part of the inference process. 


3. The remaining nodes, denoted Vinduct, are a set of inductive nodes that are not 
used to compute the loss function, and neither these nodes nor their associated 
edges participate in message-passing during the training phase. However, they 
do participate in message-passing during the inference phase and their labels 
are predicted as the outcome of inference. 
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13.3. 


If there are no transductive nodes, and hence the test nodes (and their associated 
edges) are not available during the training phase, then the training is generally 
referred to as inductive learning, which can be considered to be a form of super- 
vised learning. However, if there are transductive nodes then it is called transductive 
learning, which may be viewed as a form of semi-supervised learning. 


13.2.6 Edge classification 


In some applications we wish to make predictions about the edges of the graph 
rather than the nodes. A common form of edge classification task is edge completion 
in which the goal is to determine whether an edge should be present between two 
nodes. Given a set of node embeddings, the dot product between pairs of embeddings 
can be used to define a probability p(n, m) for the presence of an edge between nodes 
n and m by using the logistic sigmoid function: 


p(n, m) =o (hp hm) . (13.22) 


An example application would be predicting whether two people in a social network 
have shared interests and therefore might wish to connect. 


13.2.7 Graph classification 


In some applications of graph neural networks, the goal is to predict the proper- 
ties of new graphs given a training set of labelled graphs G,,...,G~. This requires 
that we combine the final-layer embedding vectors in a way that does not depend 
on the arbitrary node ordering, thereby ensuring that the output predictions will be 
invariant to that ordering. The goal is somewhat like that of the Aggregate function 
except that all nodes in the graph are included, not just the neighbourhood sets of the 
individual nodes. The simplest approach is to take the sum of the node-embedding 


vectors: 
y=f bs nie) (13.23) 


nEV 


where the function f may contain learnable parameters such as a linear transforma- 
tion or a neural network. Other invariant aggregation functions can be used such as 
averages or element-wise minimum or maximum. 

A cross-entropy loss is typically used for classification problems, such as la- 
belling a candidate drug molecule as toxic or safe, and a squared-error loss for re- 
gression problems, such as predicting the solubility of a candidate drug molecule. 
Graph-level predictions correspond to an inductive task since there must be separate 
sets of graphs for training and for inference. 


General Graph Networks 


There are many variations and extensions of the graph networks considered so far. 
Here we outline a few of the key concepts along with some practical considerations. 


Section 12.1 


Exercise 13.8 
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13.3.1 Graph attention networks 


The attention mechanism is very powerful when used as the basis of a trans- 
former architecture. It can be used in the context of graph neural networks to con- 
struct an aggregation function that combines messages from neighbouring nodes. 
The incoming messages are weighted by attention coefficients Anm to give 


z\) = Aggregate ({h\ : m € N(n)}) = > Anmh\ (13.24) 
mEN (n) 


where the attention coefficients satisfy 


Anm Z 0 (13.25) 


`> Anm = 1. (13.26) 
mEN (n) 


This is known as a graph attention network (Veličković et al., 2017) and can capture 
an inductive bias that says some neighbouring nodes will be more important than 
others in determining the best update in a way that depends on the data itself. 

There are multiple ways to construct the attention coefficients, and these gener- 
ally employ a softmax function. For example, we can use a bilinear form: 


exp (hy Wh,,,) 
Daren exp (hI Whw) 
where W is a D x D matrix of learnable parameters. A more general option is to 


use a neural network to combine the embedding vectors from the nodes at each end 
of the edge: 


Anm = (13.27) 


exp {MLP (hn, hm)} 
meN exp {MLP (hn, hw )} 


where the MLP has a single continuous output variable whose value is invariant if 
the input vectors are exchanged. Provided the MLP is shared across all the nodes in 
the network, this aggregation function will be equivariant under node reordering. 

A graph attention network can be extended by introducing multiple attention 
heads in which H distinct sets of attention weights AH, are defined, for h = 
1,..., H, in which each head is evaluated using one of the mechanisms described 
above and with its own independent parameters. These are then combined in the 
aggregation step using concatenation and linear projection. Note that, for a fully- 
connected network, a multi-head graph attention network becomes a standard trans- 
former encoder. 


13.3.2 Edge embeddings 


The graph neural networks discussed above use embedding vectors that are as- 
sociated with the nodes. We have seen that some networks also have data associated 
with the edges. Even when there are no observable values associated with the edges, 


Anm = (13.28) 
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we can still maintain and update edge-based hidden variables and these can con- 
tribute to the internal representations learned by the graph neural network. 


In es to the node embeddings given by hf? , we therefore introduce edge 
embeddings ef. We can then define general message-passing equations in the form 


elt) = Updatecage (ef), hO, hW) (13.29) 
A = Aggregate (fel i mE) 0330) 
h+) = Update, ogo (hn, 2tY) . (13.31) 


The learned edge embeddings eh) from the final layer can be used directly to make 
predictions associated with the edges. 


13.3.3 Graph embeddings 


In addition to node and edge embeddings we can also maintain and update an 
embedding vector g) that relates to the graph as a whole. Bringing all these aspects 
together allows us to define a more general set of message-passing functions, and a 
richer set of learned representations, for graph-structured applications. Specifically, 
we can define general message-passing equations (Battaglia et al., 2018): 


els!) = Updates (el hf? h, g0) (13.32) 
aD = Aggregatenogs (fel!) + m E N(M) (1333) 
h+) = Update, cae (hy 2 ee g) (13.34) 


gt) = Update D gat) : nE Vh {eum : (n,m) €€}). (13.35) 


graph (8! 


These update equations start in (13.32) by updating the edge embedding vectors 
elt) based on the previous states of those vectors, on the node embeddings for the 
nodes connected by each edge, and on a graph-level embedding vector g. These 
updated edge embeddings are then aggregated across every edge connected to each 
node using (13.33) to give a set of aggregated vectors. These in turn then contribute 
to the update of the node-embedding vector {nit} based on the current node- 
embedding vectors and on the graph-level embedding vector using (13.34). Finally, 
the graph-level embedding vector is updated using (13.35) based on information 
from all the nodes and all the edges in the graph along with the graph-level em- 
bedding from the previous layer. These message-passing updates are illustrated in 
Figure 13.5 and are summarized in Algorithm 13.2. 


13.3.4 Over-smoothing 


One significant problem that can arise with some graph neural networks is called 
over-smoothing in which the node-embedding vectors tend to become very similar to 
each other after a number of iterations of message-passing, which effectively limits 
the depth of the network. One way to help alleviate this issue is to introduce residual 
connections. For example, we can modify the update operator (13.34): 


hit) = Update, cae (ht), z a gi) T h; p (13.36) 


Figure 13.5 
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E- 


Illustration of the general graph message-passing updates defined by (13.32) to (13.35), showing 


(a) edge updates, (b) node updates, and (c) global graph updates. In each case the variable being updated is 
shown in red and the variables that contribute to that update are those shown in red and blue. 
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Another approach for mitigating the effects of over-smoothing is to allow the 
output layer to take information from all previous layers of the network and not just 
the final convolutional layer. This can be done for example by concatenating the 
representations from previous layers: 


Yn =f (hP ® h? Bee hH) (13.37) 


where a@ b denotes the concatenation of vectors a and b. A variant of this would be 
to combine the vectors using max pooling instead of concatenation. In this case each 
element of the output vector is given by the max of all the corresponding elements 
of the embedding vectors from the previous layers. 


13.3.5 Regularization 


Standard techniques for regularization can be used with graph neural networks, 
including the addition of penalty terms, such as the sum-of-squares of the parameter 
values, to the loss function. In addition, some regularization methods have been 
developed specifically for graph neural networks. 

Graph neural networks already employ weight sharing to achieve permutation 
equivariance and invariance, but typically they have independent parameters in each 
layer. However, weights and biases can also be shared across layers to reduce the 
number of independent parameters. 

Dropout in the context of graph neural networks involves omitting random sub- 
sets of the graph nodes during training, with a fresh random subset chosen for each 
forward pass. This can likewise be applied to the edges in the graph in which ran- 
domly selected subsets of entries in the adjacency matrix are removed, or masked, 
during training. 
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Algorithm 13.2: Graph neural network with node, edge, and graph embeddings 


Input: Undirected graph G = (V, £) 
Initial node embeddings {h} 
Initial edge embeddings fe), 
Initial graph embedding g 
Output: Final node embeddings {h} 
Final edge embeddings {e 
Final graph embedding g% 


// Iterative message-passing 
forl € {0,..., L — 1} do 


elt) + Update 


(ef), ma Ea) 


ZUTD + Aggregate, ode (Lein : mE N(n)}) 


edge 


ae Update 


node 


oea gl) 


gt) + Updategapn (8, {ai}, {elim }) 
end for 
return {hi}, {eH}, gD) 


13.3.6 Geometric deep learning 


We have seen how permutation symmetry is a key consideration when design- 
ing deep learning models for graph-structured data. It acts as a form of inductive 
bias, dramatically reducing the data requirements while improving predictive perfor- 
mance. In applications of graph neural networks associated with spatial properties, 
such as graphics meshes, fluid flow simulations, or molecular structures, there are 
additional equivariance and invariance properties that can be built into the network 
architecture. 

Consider the task of predicting the properties of a molecule, for example when 
exploring the space of candidate drugs. The molecule can be represented as a list 
of atoms of given types (carbon, hydrogen, nitrogen, etc.) along with the spatial 
coordinates of each atom expressed as a three-dimensional column vector. We can 
introduce an associated embedding vector for each atom n at each layer l, denoted 
by rl ) and these vectors can be initialized with the known atom coordinates. How- 
ever, the values for the elements of these vectors depends on the arbitrary choice of 
coordinate system, whereas the properties of the molecule do not. For example, the 
solubility of the molecule is unchanged if it is rotated in space or translated to a new 
position relative to the origin of the coordinate system, or if the coordinate system 
itself is reflected to give the mirror image version of the molecule. The molecular 
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properties should therefore be invariant under such transformations. 

By making careful choices of the functional forms for the update and aggre- 
gation operations (Satorras, Hoogeboom, and Welling, 2021), the new embeddings 
r can be incorporated into the graph neural network update equations (13.29) to 
(13.31) to achieve the required symmetry properties: 


allt) = Updatecigs (e0, hO, n9, [in rO) 13.38) 

rD =rO+0 XO (rP — 2) o (elt?) (13.39) 
(n,m)EE 

z+) — Aggregatenoae ({e%h) : m € N(n)}) (13.40) 

h+) = Updatengae (hP, 2) (13.41) 


Note that the quantity Ir? = ri) ||? 


coordinates r® and ri), 


represents the squared distance between the 


and this does not depend on translations, rotations, or re- 


flections. Also, the coordinates rl ) 


(0) td) 


are updated through a linear combination of the 


relative differences {ra ®). Here @ (ef a) is a general scalar function of the 


edge embeddings and is represented by a neural network, and the coefficient C is 
typically set equal to the reciprocal of the number of terms in the sum. It follows 
that under such transformations, the messages in (13.38), (13.40), and (13.41) are 
invariant and the coordinate embeddings given by (13.39) are equivariant. 

We have seen many examples of symmetries in structured data, from transla- 
tions of objects within images and the permutation of node orderings on graphs, to 
rotations and translations of molecules in three-dimensional space. Capturing these 
symmetries in the structure of a deep neural network is a powerful form of inductive 
bias and forms the basis of a rich field of research known as geometric deep learning 
(Bronstein et al., 2017; Bronstein et al., 2021). 


(x) Show that the permutation (A, B, C, D, E) > (C, E, A, D, B) corresponding to 
the two choices of node ordering in Figure 13.2 can be expressed in the form (13.5) 
with a permutation matrix given by (13.1). 


(x x) Show that the number of edges connected to each node of a graph is given 
by the corresponding diagonal element of the matrix A? where A is the adjacency 
matrix. 


(x) Draw the graph whose adjacency matrix is given by 


0 


(13.42) 
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13.4 


13.5 


13.6 


13.7 


13.8 


13.9 


13.10 


(x x) Show that the effect of pre-multiplying a data matrix X using a permutation 
matrix P defined by (13.3) is to create a new data matrix X given by (13.4) whose 
rows are permuted according to the permutation function 7(-). 


(x x) Show that the transformed adjacency matrix A defined by (13.5), where P is 
defined by (13.3), is such that both the rows and the columns are permuted according 
to the permutation function 7(-) relative to the original adjacency matrix A. 


(x x) In this exercise we write the update equations (13.16) as graph-level equations 
using matrices. To keep the notation uncluttered, we omit the layer index l. First, 
gather the node-embedding vectors {h,,} into an N x D matrix H in which row n 
is given by hT. Then show that the neighbourhood-aggregated vectors Zn given by 


Zn = `> hm (13.43) 
mEN (n) 


can be written in matrix form as Z = AH where Z is the N x D matrix in which 
row n is given by zT, and A is the adjacency matrix. Finally, show that the argument 
to the nonlinear activation function in (13.16) can be written in matrix form as 


AHW righ + HW at T l1pb” (13.44) 
where 1p is the D-dimensional column vector in which all elements are 1. 


(x x) By making use of the equivariance property (13.19) for layer l of a deep graph 
convolutional network along with the permutation property (13.4) for the node vari- 
ables, show that a complete deep graph convolutional network defined by (13.18) is 
also equivariant. 


(xx) Explain why the aggregation function defined by (13.24), in which the attention 
weights are given by (13.28), is equivariant under a reordering of the nodes in the 
graph. 


(x) Show that a graph attention network in which the graph is fully connected, so that 
there is an edge between every pair of nodes, is equivalent to a standard transformer 
architecture. 


(xx) When a coordinate system is translated, the location of an object defined by 
that coordinate system is transformed using 


r=r-+ec (13.45) 


where c is a fixed vector describing the translation. Similarly, if the coordinate sys- 
tem is rotated and/or mirror reflected, the location vector of an object is transformed 
using 

r=Rr (13.46) 


where R is an orthogonal matrix whose inverse is given by its transpose so that 


RR’ = RTR =I. (13.47) 
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Using these properties, show that under translations, rotations, and reflections, the 
messages in (13.38), (13.40), and (13.41) are invariant, and that the coordinate em- 
beddings given by (13.39) are equivariant. 


Check for 


updates 


14 


Sampling 


There are many situations in deep learning where we need to create synthetic exam- 
ples of a variable z from a probability distribution p(z). Here z might be a scalar 
and the distribution might be a univariate Gaussian, or z might be a high-resolution 
image and p(z) might be a generative model defined by a deep neural network. The 
process of creating such examples is known as sampling, also known as Monte Carlo 
sampling. For many simple distributions there are numerical techniques that gener- 
ate suitable samples directly, whereas for more complex distributions, including ones 
that are defined implicitly, we may need more sophisticated approaches. We adopt 
the convention of referring to each instantiated value as a sample, in contrast to the 
convention used in classical statistics whereby ‘sample’ refers to a set of values. 

In this chapter we focus on aspects of sampling that are most relevant to deep 
learning. Further information on Monte Carlo methods more generally can be found 
in Gilks, Richardson, and Spiegelhalter (1996) and Robert and Casella (1999). 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 429 
C. M. Bishop, H. Bishop, Deep Learning, https://doi.org/10.1007/978-3-03 1-45468-4_ 14 


430 14. SAMPLING 


14.1. 


Exercise 14.1 


Figure 14.1 


Basic Sampling Algorithms 


In this section, we explore a variety of relatively simple strategies for generating 
random samples from a given distribution. Because the samples will be generated 
by a computer algorithm, they will in fact be pseudo-random, that is, they will be 
calculated using a deterministic algorithm but must nevertheless pass appropriate 
tests for randomness. Here we will assume that an algorithm has been provided 
that generates pseudo-random numbers distributed uniformly over (0, 1), and indeed 
most software environments have such a facility built in. 


14.1.1 Expectations 


Although for some applications the samples themselves may be of direct inter- 
est, in other situations the goal is to evaluate expectations with respect to the distri- 
bution. Suppose we wish to find the expectation of a function f(z) with respect to a 
probability distribution p(z). Here, the components of z might comprise discrete or 
continuous variables or some combination of the two. For continuous variables the 
expectation is defined by 


E[f] = f f(z)p(z) dz (14.1) 


where the integral is replaced by summation for discrete variables. This is illus- 
trated schematically for a single continuous variable in Figure 14.1. We will suppose 
that such expectations are too complex to be evaluated exactly using analytical tech- 
niques. 

The general idea behind sampling methods is to obtain a set of samples z” 
(where | = 1,..., L) drawn independently from the distribution p(z). This allows 
the expectation (14.1) to be approximated by a finite sum: 


1 L 
Faz Dd fe). (14.2) 


If the samples z% are drawn from the distribution p(z), then E[f] = E[f(z)] and so 
the estimator f has the correct mean. We can also write this in the form 


Schematic illustration of a function f(z) 
whose expectation is to be evaluated p(z) F(z) 
with respect to a distribution p(z). 


& 
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1 L 
E[f(2)] = = A f(z) (14.3) 
l=1 


S| 


where the symbol ~ denotes that the right-hand side is an unbiased estimator of the 
left-hand side, that is the two sides are equal when averaged over the noise distribu- 
tion. 

The variance of the estimator (14.2) is given by 


var[f] = JE (f = ELD’] . (14.4) 


which is the variance of the function f(z) under the distribution p(z). Note that the 
linear decrease of this variance with increasing L does not depend on the dimension- 
ality of z, and that, in principle, high accuracy may be achievable with a relatively 
small number of samples {z}. The problem, however, is that the samples {z® } 
might not be independent, and so the effective sample size might be much smaller 
than the apparent sample size. Also, referring back to Figure 14.1, note that if f(z) 
is small in regions where p(z) is large and vice versa, then the expectation may be 
dominated by regions of small probability, implying that relatively large sample sizes 
will be required to achieve sufficient accuracy. 


14.1.2 Standard distributions 


We now consider how to generate random numbers from simple nonuniform dis- 
tributions, assuming that we already have available a source of uniformly distributed 
random numbers. Suppose that z is uniformly distributed over the interval (0, 1), 
and that we transform the values of z using some function g(-) so that y = g(z). The 
distribution of y will be governed by 


dz 


ai (14.5) 


p(y) = p(z) 


where, in this case, p(z) = 1. Our goal is to choose the function g(z) such that the 
resulting values of y have some specific desired distribution p(y). Integrating (14.5) 
we obtain j 

z=] p@)=hu)ag (146) 

—co 
which is the indefinite integral of p(y). Thus, y = h~1‘(z), and so we have to 
transform the uniformly distributed random numbers using a function that is the 
inverse of the indefinite integral of the desired distribution. This is illustrated in 
Figure 14.2. 
Consider for example the exponential distribution 


p(y) = Aexp(—Ay) (14.7) 


where 0 < y < ov. In this case the lower limit of the integral in (14.6) is 0, and so 
h(y) = 1 — exp(—Ay). Thus, if we transform our uniformly distributed variable z 
using y = —\~' In(1 — 2), then y will have an exponential distribution. 
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Figure 14.2 Geometrical interpretation of the 
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transformation method for generating 
non-uniformly distributed random 
numbers. h(y) is the indefinite integral 
of the desired distribution p(y). If a 
uniformly distributed random variable 
z is transformed using y = h~‘(z), 
then y will be distributed according to 


p(y). 


Another example of a distribution to which the transformation method can be 
applied is given by the Cauchy distribution 


1 1 
ply) = (14.8) 


nity? 
In this case, the inverse of the indefinite integral can be expressed in terms of the tan 
function. 
The generalization to multiple variables involves the Jacobian of the change of 
variables, so that 


O(a, kre ,2M) 
oly, rex YM) l 


As a final example of the transformation technique, we consider the Box—Muller 
method for generating samples from a Gaussian distribution. First, suppose we gen- 
erate pairs of uniformly distributed random numbers 21, 22 € (—1, 1), which we can 
do by transforming a variable distributed uniformly over (0,1) using z > 2z — 1. 
Next we discard each pair unless it satisfies z? + 22 < 1. This leads to a uniform 
distribution of points inside the unit circle with p(z,z2) = 1/7, as illustrated in 
Figure 14.3. Then, for each pair z1, z2 we evaluate the quantities 


Biss --, Ym) = p(21;,---; ZM) (14.9) 


—2]nr? 12 
yWo= 4 ( = ) (14.10) 
—2]nr? ie 
Y = 4 ( z ) (14.11) 
r 
where r? = 27 + 22. Then the joint distribution of yı and yz is given by 
Oz, z2) 
P(y1,y2) = plz1,22)|_—~ 
a) oly, y2) 


exp( #12) È exp(=/2)] (14.12) 


(z= Ti 


Figure 14.3 The Box—Muller method for generating Gaussian- 1 
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distributed random numbers starts by generating samples 
from a uniform distribution inside the unit circle. 


and so yı and y> are independent and each has a Gaussian distribution with zero 
mean and unit variance. 

If y has a Gaussian distribution with zero mean and unit variance, then oy + pu 
will have a Gaussian distribution with mean u and variance a”. To generate vector- 
valued variables having a multivariate Gaussian distribution with mean jz and co- 
variance X, we can make use of the Cholesky decomposition, which takes the form 
£ = LL" (Deisenroth, Faisal, and Ong, 2020). Then, if z is a random vector whose 
components are independent and Gaussian distributed with zero mean and unit vari- 
ance, then y = u + Lz will be Gaussian with mean u and covariance X. 

Clearly, the transformation technique depends for its success on the ability to 
calculate and then invert the indefinite integral of the required distribution. Such 
operations are feasible only for a limited number of simple distributions, and so we 
must turn to alternative approaches in search of a more general strategy. Here we 
consider two techniques called rejection sampling and importance sampling. Al- 
though mainly limited to univariate distributions and thus not directly applicable to 
complex problems in many dimensions, they do form important components in more 
general strategies. 


14.1.3 Rejection sampling 


The rejection sampling framework allows us to sample from relatively complex 
distributions, subject to certain constraints. We begin by considering univariate dis- 
tributions and subsequently discuss the extension to multiple dimensions. 

Suppose we wish to sample from a distribution p(z) that is not one of the simple, 
standard distributions considered so far and that sampling directly from p(z) is dif- 
ficult. Furthermore suppose, as is often the case, that we are easily able to evaluate 
p(z) for any given value of z, up to some normalizing constant Z, so that 


1 


p(z) = 7 be) (14.13) 
Pp 


where p(z) can readily be evaluated, but Zp is unknown. 

To apply rejection sampling, we need some simpler distribution q(z), sometimes 
called a proposal distribution, from which we can readily draw samples. We next 
introduce a constant k whose value is chosen such that kq(z) > p(z) for all val- 
ues of z. The function kq(z) is called the comparison function and is illustrated 
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Figure 14.4 


Exercise 14.7 


Exercise 14.8 


In the rejection sampling kq(z 
method, samples are drawn kq(zo) a) 
from a simple distribution 

q(z) and rejected if they fall 

in the grey area between the 

unnormalized distribution p(z) 

and the scaled distribution 

kq(z). The resulting samples 

are distributed according to 

p(z), which is the normalized ZO z 
version of p(z). 


for a univariate distribution in Figure 14.4. Each step of the rejection sampler in- 
volves generating two random numbers. First, we generate a number zo from the 
distribution q(z). Next, we generate a number uo from the uniform distribution over 
(0, kq(zo)]. This pair of random numbers has uniform distribution under the curve 
of the function kq(z). Finally, if uo > p(zo) then the sample is rejected, otherwise 
Ug is retained. Thus, the pair is rejected if it lies in the grey shaded region in Fig- 
ure 14.4. The remaining pairs then have uniform distribution under the curve of p(z), 
and hence the corresponding z values are distributed according to p(z), as desired. 

The original values of z are generated from the distribution q(z), and these sam- 
ples are then accepted with probability p(z)/kq(z), and so the probability that a 
sample will be accepted is given by 


p(accept) 


[ @e)/tal2)} ale) az 
os f Bz) dz. (14.14) 


Thus, the fraction of points that are rejected by this method depends on the ratio of 
the area under the unnormalized distribution p(z) to the area under the curve kq(z). 
We therefore see that the constant k should be as small as possible subject to the 
limitation that kq(z) must be nowhere less than p(z). 

As an illustration of the use of rejection sampling, consider the task of sampling 
from the gamma distribution 


b%z*~! exp(—bz) 


Gam(z|a,b) = Ta) : 


(14.15) 


which, for a > 1, has a bell-shaped form, as shown in Figure 14.5. A suitable 
proposal distribution is therefore the Cauchy (14.8) because this too is bell-shaped 
and because we can use the transformation method, discussed earlier, to sample from 
it. We need to generalize the Cauchy slightly to ensure that it nowhere has a smaller 
value than the gamma distribution. This can be achieved by transforming a uniform 
random variable y using z = btan y + c, which gives random numbers distributed 
according to 
k 


U2) = FG Or ee (14.16) 


Figure 14.5 Plot showing the gamma dis- 
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tribution given by (14.15) as 
the green curve, with a scaled 
Cauchy proposal distribution 
shown by the red curve. Sam- 0.1 
ples from the gamma distribu- 

tion can be obtained by sam- p(z) 
pling from the Cauchy and 

then applying the rejection 0.05 
sampling criterion. 


The minimum reject rate is obtained by setting c = a— 1, and b? = 2a—1 and choos- 
ing the constant k to be as small as possible while still satisfying the requirement 
kq(z) > p(z). The resulting comparison function is also illustrated in Figure 14.5. 


14.1.4 Adaptive rejection sampling 


In many instances where we might wish to apply rejection sampling, it can be 
difficult to determine a suitable analytic form for the envelope distribution q(z). An 
alternative approach is to construct the envelope function on the fly based on mea- 
sured values of the distribution p(z) (Gilks and Wild, 1992). Constructing an enve- 
lope function is particularly straightforward when p(z) is log concave, in other words 
when In p(z) has derivatives that are non-increasing functions of z. The construction 
of a suitable envelope function is illustrated graphically in Figure 14.6. 

The function In p(z) and its gradient are evaluated at some initial set of grid 
points, and the intersections of the resulting tangent lines are used to construct the 
envelope function. Next a sample value is drawn from the envelope distribution. 
This is straightforward because the log of the envelope distribution is a succession 
of linear functions, and hence the envelope distribution itself comprises a piecewise 
exponential distribution of the form 


q(z) = kidi exp {—Ai(z — zi-1)}, B-L ZK Ge (14.17) 


Once a sample has been drawn, the usual rejection criterion can be applied. If the 
sample is accepted, then it will be a draw from the desired distribution. If, however, 
the sample is rejected, then it is incorporated into the set of grid points, a new tangent 
line is computed, and the envelope function is thereby refined. As the number of 
grid points increases, so the envelope function becomes a better approximation of 
the desired distribution p(z) and the probability of rejection decreases. 

There is a variant of the algorithm exists that avoids the evaluation of derivatives 
(Gilks, 1992). The adaptive rejection sampling framework can also be extended to 
distributions that are not log concave, simply by following each rejection sampling 
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Figure 14.6 


Figure 14.7 


In rejection sampling, if a dis- 
tribution is log concave then an In p(z) 
envelope function can be con- 


structed using the tangent lines f AY 


computed at a set of grid points. 
If a sample point is rejected, it 
is added to the set of grid points 
and used to refine the envelope 
distribution. 


ZI Z2 23 z 


step with a Metropolis—Hastings step (to be discussed in Section 14.2.3), giving rise 
to adaptive rejection Metropolis sampling (Gilks, Best, and Tan, 1995). 

For rejection sampling to be of practical value, we require that the comparison 
function is close to the required distribution so that the rate of rejection is kept to a 
minimum. Now let us examine what happens when we try to use rejection sampling 
in spaces of high dimensionality. Consider, for illustration, a somewhat artificial 
problem in which we wish to sample from a zero-mean multivariate Gaussian distri- 
bution with covariance opl, where I is the unit matrix, by rejection sampling from a 
proposal distribution that is itself a zero-mean Gaussian distribution having covari- 
ance acl. Clearly, we must have o? > ar to ensure that there exists a k such that 
kq(z) > p(z). In D-dimensions, the optimum value of k is given by k = (o4/0p)” 
as illustrated for D = 1 in Figure 14.7. The acceptance rate will be the ratio of 
volumes under p(z) and kq(z), which, because both distributions are normalized, is 
just 1/k. Thus, the acceptance rate diminishes exponentially with dimensionality. 
Even if a, exceeds o, by just 1%, for D = 1,000 the acceptance ratio will be ap- 
proximately 1/20,000. In this illustrative example, the comparison function is close 
to the required distribution. For more practical examples, where the desired distri- 
bution may be multimodal and sharply peaked, it will be extremely difficult to find 


> 


Illustrative example used to 
highlight a limitation of rejec- 
tion sampling. Samples are 
drawn from a Gaussian dis- 
tribution p(z) shown by the 
green curve, by using rejec- 
tion sampling from a proposal 
distribution q(z) that is also 
Gaussian and whose scaled 
version kq(z) is shown by the 
red curve. 
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a good proposal distribution and comparison function. Furthermore, the exponential 
decrease of the acceptance rate with dimensionality is a generic feature of rejection 
sampling. Although rejection can be a useful technique in one or two dimensions, it 
is unsuited to problems of high dimensionality. It can, however, play a role as a sub- 
routine in more sophisticated algorithms for sampling in high-dimensional spaces. 


14.1.5 Importance sampling 


One reason for wishing to sample from complicated probability distributions is 
to evaluate expectations of the form (14.1). The technique of importance sampling 
provides a framework for approximating expectations directly but does not itself 
provide a mechanism for drawing samples from a distribution p(z). 

The finite sum approximation to the expectation, given by (14.2), depends on 
being able to draw samples from the distribution p(z). Suppose, however, that it is 
impractical to sample directly from p(z) but that we can evaluate p(z) easily for any 
given value of z. One simplistic strategy for evaluating expectations would be to 
discretize z-space into a uniform grid and to evaluate the integrand as a sum of the 
form 


L 
Elf] ~ Sp) f2). (14.18) 
l=1 


An obvious problem with this approach is that the number of terms in the summation 
grows exponentially with the dimensionality of z. Furthermore, as we have already 
noted, the kinds of probability distributions of interest will often have much of their 
mass confined to relatively small regions of z-space and so uniform sampling will be 
very inefficient because in high-dimensional problems, only a very small proportion 
of the samples will make a significant contribution to the sum. We would really 
like to choose sample points from regions where p(z) is large or ideally where the 
product p(z) f(z) is large. 

As with rejection sampling, importance sampling is based a proposal distribution 
q(z) from which it is easy to draw samples, as illustrated in Figure 14.8. We can then 
express the expectation in the form of a finite sum over samples {z“!)} drawn from 


q(Z): 


Ef] = I Hapeja 


2 0) 
~ Da: ) t0), (14.19) 


The quantities r; = p(z®)/q(z®) are known as importance weights, and they cor- 
rect the bias introduced by sampling from the wrong distribution. Note that, unlike 
rejection sampling, all the samples generated are retained. 

Often the distribution p(z) can be evaluated only up to a normalization constant, 
so that p(z) = p(z)/Z, where p(z) can be evaluated easily, whereas Zp is unknown. 
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Figure 14.8 Importance sampling addresses the 
problem of evaluating the expectation of 
a function f(z) with respect to a distri- 
bution p(z) from which it is difficult to 
draw samples directly. Instead, samples 
{2} are drawn from a simpler distribu- 
tion q(z), and the corresponding terms 
in the summation are weighted by the 
ratios p(z™) /q(z). 


Similarly, we may wish to use an importance sampling distribution g(z) = q(z) /Z, 
which has the same property. We then have 
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(14.20) 


where 7 = p(z)/q(z™). We can use the same sample set to evaluate the ratio 
Zp/Zq With the result 


2 
Si 
Mer 
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(14.21) 
I=1 
and hence the expectation in (14.20) is given by a weighted sum: 
L 
E[f] ~ So wif(2™) (14.22) 
l=1 
where we have defined 
F BZ (1) 


mim Yom Bq) 


Note that {w;} are non-negative numbers that sum to one. 

As with rejection sampling, the success of importance sampling depends cru- 
cially on how well the sampling distribution q(z) matches the desired distribution 
p(z). If, as is often the case, p(z) f(z) is strongly varying and has a significant pro- 
portion of its mass concentrated over relatively small regions of z-space, then the 
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set of importance weights {r;} may be dominated by a few weights having large 
values, with the remaining weights being relatively insignificant. Thus, the effective 
sample size can be much smaller than the apparent sample size L. The problem is 
even more severe if none of the samples falls in the regions where p(z) f(z) is large. 
In that case, the apparent variances of r; and rf (z‘) may be small even though 
the estimate of the expectation may be severely wrong. Hence, a major drawback of 
importance sampling is its potential to produce results that are arbitrarily in error and 
with no diagnostic indication. This also highlights a key requirement for the sam- 
pling distribution q(z), namely that it should not be small or zero in regions where 
p(z) may be significant. 


14.1.6 Sampling-importance-resampling 


The rejection sampling method discussed in Section 14.1.3 depends in part for 
its success on the determination of a suitable value for the constant k. For many 
pairs of distributions p(z) and q(z), it will be impractical to determine a suitable 
value for k as any value that is sufficiently large to guarantee a bound on the desired 
distribution will lead to impractically small acceptance rates. 

As with rejection sampling, the sampling-importance-resampling approach also 
makes use of a sampling distribution q(z) but avoids having to determine the constant 
k. There are two stages to the scheme. In the first stage, L samples z,..., 2") are 
drawn from q(z). Then in the second stage, weights w1, ..., wz are constructed us- 
ing (14.23). Finally, a second set of L samples is drawn from the discrete distribution 
(2 ,...,z)) with probabilities given by the weights (w1,..., wz). 

The resulting L samples are only approximately distributed according to p(z), 
but the distribution becomes correct in the limit L — oo. To see this, consider the 
univariate case, and note that the cumulative distribution of the resampled values is 
given by 


pesa = Yow 
lz) <a 


E (® < aP®)/®) (14.24) 


di P(®)/q(®) 


where J(.) is the indicator function (which equals 1 if its argument is true and 0 
otherwise). Taking the limit L — oo and assuming suitable regularity of the dis- 
tributions, we can replace the sums by integrals weighted according to the original 
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sampling distribution g(z): 
[Tes D G/U} eae 
p(z <a) = 
[BE /a} a2) a 
re < a)p(z) dz 


(14.25) 
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which is the cumulative distribution function of p(z). Again, we see that normaliza- 
tion of p(z) is not required. 

For a finite value of L and a given initial sample set, the resampled values will 
only approximately be drawn from the desired distribution. As with rejection sam- 
pling, the approximation improves as the sampling distribution q(z) gets closer to 
the desired distribution p(z). When q(z) = p(z), the initial samples (z“),... , 2") 
have the desired distribution and the weights w» = 1/L, so that the resampled values 
also have the desired distribution. 

If moments with respect to the distribution p(z) are required, then they can be 
evaluated directly using the original samples together with the weights, because 
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(14.26) 


Markov Chain Monte Carlo 


In the previous section, we discussed the rejection sampling and importance sam- 
pling strategies for evaluating expectations of functions, and we saw that they suffer 
from severe limitations particularly in spaces of high dimensionality. We therefore 
turn in this section to a very general and powerful framework called Markov chain 
Monte Carlo, which allows sampling from a large class of distributions and which 
scales well with the dimensionality of the sample space. Markov chain Monte Carlo 
methods have their origins in physics (Metropolis and Ulam, 1949), and it was only 
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towards the end of the 1980s that they started to have a significant impact in the field 
of statistics. 

As with rejection and importance sampling, we again sample from a proposal 
distribution. This time, however, we maintain a record of the current state z, 
and the proposal distribution q(z|z‘~)) is conditioned on this current state, and so 
the sequence of samples z!),z'?),... forms a Markov chain. Again, if we write 
p(z) = p(z)/Zp, we will assume that p(z) can readily be evaluated for any given 
value of z, although the value of Z, may be unknown. The proposal distribution 
is chosen to be sufficiently simple that it is straightforward to draw samples from it 
directly. At each cycle of the algorithm, we generate a candidate sample z* from 
the proposal distribution and then accept the sample according to an appropriate 
criterion. 


14.2.1 The Metropolis algorithm 


In the basic Metropolis algorithm (Metropolis et al., 1953), we assume that the 
proposal distribution is symmetric, that is q(z4|zg) = q(zpg|z4) for all values of 
za and zp. The candidate sample is then accepted with probability 


A(z*,2() = min (1 oe ) (14.27) 


This can be achieved by choosing a random number u with uniform distribution over 
the unit interval (0, 1) and then accepting the sample if A(z*,z‘")) > u. Note that 
if the step from z(™ to z* causes an increase in the value of p(z), then the candidate 
point is certain to be kept. 

If the candidate sample is accepted, then z+} = z*, otherwise the candidate 
point z* is discarded, z'T+) is set to z‘7), and another candidate sample is drawn 
from the distribution q(z|z'+)). This is in contrast to rejection sampling, where re- 
jected samples are simply discarded. In the Metropolis algorithm, when a candidate 
point is rejected, the previous sample is included in the final list of samples, leading 
to multiple copies of samples. Of course, in a practical implementation, only a single 
copy of each retained sample would be kept, along with an integer weighting factor 
recording how many times that state appears. As we will see, if g(z4|zz) is positive 
for any values of z4 and zp (this is a sufficient but not necessary condition), the 
distribution of z'7) tends to p(z) as T — oo. It should be emphasized, however, that 
the sequence z!),z(?),... is not a set of independent samples from p(z) because 
successive samples are highly correlated. If we wish to obtain independent samples, 
then we can discard most of the sequence and just retain every Mth sample. For 
M sufficiently large, the retained samples will for all practical purposes be inde- 
pendent. The Metropolis algorithm in summarized in Algorithm 14.1. Figure 14.9 
shows a simple illustrative example of sampling from a two-dimensional Gaussian 
distribution using the Metropolis algorithm in which the proposal distribution is an 
isotropic Gaussian. 

Further insight into the nature of Markov chain Monte Carlo algorithms can be 
gleaned by looking at the properties of a specific example, namely a simple random 
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Algorithm 14.1: Metropolis sampling 


Input: Unnormalized distribution p(z) 
Proposal distribution ¢(z|z) 
Initial state z 
Number of iterations T 
Output: z ~ p(z) 


Zprev + z0) 


// Iterative message-passing 

for 7 € {1,...,T} do 

Z“ ~ q(Z|Zprev) // Sample from proposal distribution 
Un u(0, 1) // Sample from uniform 

if p(z*) / D(Zprev) > u then 

| Aa a UR ae NY 

else 

| Zprev © Zprev // 2) = 2-9 

end if 


end for 


return Zprey // z0 


walk. Consider a state space z consisting of the integers, with probabilities 


p(z +8 = z0) = 0.5 (14.28) 
p(t = 24 1) = 0.25 (14.29) 
p(t = g(t) _. 1) = 0.25 (14.30) 


where z‘7) denotes the state at step 7. If the initial state is z®) = 0, then by symmetry 
the expected state at time 7 will also be zero E[z‘™] = 0, and similarly it is easily 
seen that E[(z‘"))?] = 7/2. Thus, after 7 steps, the random walk has travelled 
only a distance that on average is proportional to the square root of 7. This square 
root dependence is typical of random walk behaviour and shows that random walks 
are very inefficient in exploring the state space. As we will see, a central goal in 
designing Markov chain Monte Carlo methods is to avoid random walk behaviour. 


14.2.2 Markov chains 


Before discussing Markov chain Monte Carlo methods in more detail, it is useful 
to study some general properties of Markov chains. In particular, we ask under what 
circumstances will a Markov chain converge to the desired distribution. A first-order 
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Figure 14.9 A simple illustration using the 


Figure 11.29 


Metropolis algorithm to sam- 3 
ple from a Gaussian distri- 

bution whose one standard- 25l 
deviation contour is shown by : 
the ellipse. The proposal dis- 

tribution is an isotropic Gaus- 2l 


sian distribution whose stan- 
dard deviation is 0.2. Steps 
that are accepted are shown 1.5L 
as green lines, and rejected : 
steps are shown in red. A 
total of 150 candidate sam- 1t 
ples are generated, of which 
43 are rejected. 


Markov chain is defined to be a series of random variables z®,..., 2”) such that 
the following conditional independence property holds for m € {1,..., M — 1}: 


p22) = z™) = plz ™+D z), (14.31) 


which can be represented as a directed graphical model in the form of a chain. We 
can then specify the Markov chain by giving the probability distribution for the ini- 
tial variable p(z)) together with the conditional distributions for subsequent vari- 
ables in the form of transition probabilities Tm (20 ,2°"T)) = p(z™+9 |z), A 
Markov chain is called homogeneous if the transition probabilities are the same for 
all m. 

The marginal probability for a particular variable can be expressed in terms of 
the marginal probability for the previous variable in the chain: 


pD) = | pla aplat) dat (14.32) 


where the integral is replaced by a summation for discrete variables. A distribution 
is said to be invariant, or stationary, with respect to a Markov chain if each step in 
the chain leaves that distribution invariant. Thus, for a homogeneous Markov chain 
with transition probabilities T(z’, z), the distribution p* (z) is invariant if 


p*(z) = | Tæ a) dz’. (14.33) 


Note that a given Markov chain may have more than one invariant distribution. For 
instance, if the transition probabilities are given by the identity transformation, then 
any distribution will be invariant. 
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A sufficient (but not necessary) condition for ensuring that the required distribu- 
tion p(z) is invariant is to choose the transition probabilities to satisfy the property 
of detailed balance, defined by 


p*(z)T(z, z") = p* (z')T(z', z) (14.34) 


for the particular distribution p*(z). Itis easily seen that a transition probability 
that satisfies detailed balance with respect to a particular distribution will leave that 
distribution invariant, because 


freme. dz’ = JPTE) (14.35) 
= p* (z) foer dz’ (14.36) 
=p" (z). (14.37) 


A Markov chain that respects detailed balance is said to be reversible. 

Our goal is to use Markov chains to sample from a given distribution. We can 
achieve this if we set up a Markov chain such that the desired distribution is invariant. 
However, we must also require that for m — oo, the distribution p(z“”)) converges 
to the required invariant distribution p*(z), irrespective of the choice of initial dis- 
tribution p(z). This property is called ergodicity, and the invariant distribution 
is then called the equilibrium distribution. Clearly, an ergodic Markov chain can 
have only one equilibrium distribution. It can be shown that a homogeneous Markov 
chain will be ergodic, subject only to weak restrictions on the invariant distribution 
and the transition probabilities (Neal, 1993). 

In practice we often construct the transition probabilities from a set of ‘base’ 


transitions B,,..., Bx. This can be achieved through a mixture distribution of the 
form 
K 
T(z’, z) = A ak Bp(z', z) (14.38) 
k=1 
for some set of mixing coefficients &1,..., ag satisfying a, > 0 and Xor ap = 1. 


Alternatively, the base transitions may be combined through successive application, 
so that 


T(z',z) = `> ee `> Bı (z',z1)... Bg-ı(Zk-2,ZK-1)Bg(ZzK-1,z). (14.39) 


Zı Zn—1 


If a distribution is invariant with respect to each of the base transitions, then clearly it 
will also be invariant with respect to either of the T(z’, z) given by (14.38) or (14.39). 
For the mixture (14.38), if each of the base transitions satisfies detailed balance, then 
the mixture transition T will also satisfy detailed balance. This does not hold for the 
transition probability constructed using (14.39), although by symmetrizing the order 
of application of the base transitions, namely B1, Bo,...,Be,Br,..., B2, Bi, de- 
tailed balance can be restored. A common example of the use of composite transition 
probabilities is where each base transition changes only a subset of the variables. 
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14.2.3 The Metropolis—Hastings algorithm 


Earlier we introduced the basic Metropolis algorithm without actually demon- 
strating that it samples from the required distribution. Before giving a proof, we first 
discuss a generalization, known as the Metropolis—Hastings algorithm (Hastings, 
1970), which applies when the proposal distribution is no longer a symmetric func- 
tion of its arguments. In particular at step 7 of the algorithm, in which the current 
state is z‘7), we draw a sample z* from the distribution q;,(z|z‘~)) and then accept it 
with probability A;,(z*,z‘")) where 


IGR jz*) 
A,(z*, 2) = mi E Jar (2 ). 14.40 
. (Zz, Zz’) = min Bz) qu (22) ( ) 


Here k labels the members of the set of possible transitions being considered. Again, 
evaluating the acceptance criterion does not require knowledge of the normalizing 
constant Zp in the probability distribution p(z) = p(z)/Z,p. For a symmetric pro- 
posal distribution, the Metropolis—Hastings criterion (14.40) reduces to the standard 
Metropolis criterion given by (14.27). Metropolis—Hastings sampling is summarized 
in Algorithm 14.2. 

We can show that p(z) is an invariant distribution of the Markov chain defined 
by the Metropolis—Hastings algorithm by showing that detailed balance, defined by 
(14.34), is satisfied. Using (14.40) we have 


P(2)qx(2'|z) Ag (2’, 2) 


min (p(z)qe(2'|z), p(z')ar(z|z’)) 
= min (p(z’)q,(z|z’), p(z)gx(z’|z)) 
= ple!)qu(al2") Ac(@, 2!) (1441) 


as required. 

The specific choice of proposal distribution can have a marked effect on the per- 
formance of the algorithm. For continuous state spaces, a common choice is a Gaus- 
sian centred on the current state, leading to an important trade-off in determining the 
variance parameter of this distribution. If the variance is small, then the proportion of 
accepted transitions will be high, but progress through the state space takes the form 
of a slow random walk leading to long correlation times. However, if the variance 
parameter is large, then the rejection rate will be high because, in the kind of com- 
plex problems we are considering, many of the proposed steps will be to states for 
which the probability p(z) is low. Consider a multivariate distribution p(z) having 
strong correlations between the components of z, as illustrated in Figure 14.10. The 
scale p of the proposal distribution should be as large as possible without incurring 
high rejection rates. This suggests that p should be of the same order as the smallest 
length scale omin. The system then explores the distribution along the more extended 
direction by means of a random walk, and so the number of steps to arrive at a state 
that is more or less independent of the original state is of order (Oyax/Omin)?- In fact 
in two dimensions, the increase in rejection rate as p increases is offset by the larger 
step sizes of those transitions that are accepted, and more generally for a multivari- 
ate Gaussian, the number of steps required to obtain independent samples scales like 
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Algorithm 14.2: Metropolis-Hastings sampling 


Input: Unnormalized distribution p(z) 
Proposal distributions {q;,(z|z) : k € 1,..., K} 
Mapping from iteration index to distribution index M (-) 
Initial state 2) 
Number of iterations T 

Output: z ~ p(z) 


Zprev < ZO) 


// Iterative message-passing 

for7 € {1,...,T} do 

k+ M(r) // get distribution index for this iteration 
Zw Gil Z| Zoran) // sample from proposal distribution 
u~U(0,1) // sample from uniform 

if p(z*)q(Zprev|z*) / P(Zprev)q(Z*|Zprev) > u then 

Mae Aa 

else 

| aae toe 2 Se 

end if 

end for 


3 
return Zprev // z0? 


(Omax/02)” where oz is the second-smallest standard deviation (Neal, 1993). These 
details aside, if the length scales over which the distributions vary are very different 
in different directions, then the Metropolis Hastings algorithm can have very slow 
convergence. 


14.2.4 Gibbs sampling 


Gibbs sampling (Geman and Geman, 1984) is a simple and widely applica- 
ble Markov chain Monte Carlo algorithm and can be seen as a special case of the 
Metropolis—Hastings algorithm. Consider the distribution p(z) = p(z1,..., zm) 
from which we wish to sample, and suppose that we have chosen some initial state 
for the Markov chain. Each step of the Gibbs sampling procedure involves replacing 
the value of one of the variables by a value drawn from the distribution of that vari- 
able conditioned on the values of the remaining variables. Thus, we replace z; by 
a value drawn from the distribution p(z;|z\;), where z; denotes the ith component 
of z, and Z\i denotes {z1,..., Zac} but with z; omitted. This procedure is repeated 
either by cycling through the variables in some particular order or by choosing the 
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Figure 14.10 Schematic illustration of using an isotropic 
Gaussian proposal distribution (blue circle) to 
sample from a correlated multivariate Gaus- 
sian distribution (red ellipse) having very differ- Tmax 
ent standard deviations in different directions, 
using the Metropolis—Hastings algorithm. To 
keep the rejection rate low, the scale p of 
the proposal distribution should be of the or- 
der of the smallest standard deviation omin, 
which leads to random walk behaviour in which Omin 
the number of steps separating states that 
are approximately independent is of order 
(Omax/Omin)” Where dmax is the largest stan- 
dard deviation. 


variable to be updated at each step at random from some distribution. 
For example, suppose we have a distribution p(z1, z2, z3) over three variables, 


and at ws T of the oe we have selected values 20, a) zs”, ” and 2), We first 
replace z0 ) by a new value z0 +D obtained by sampling from the conditional distri- 


bution 

plaie, 24”). (14.42) 
Next we replace 2 g by a value 2th) obtained by sampling from the conditional 
distribution 


plzzz t”, z) (14.43) 


so that the new value for z; is used straight away in subsequent sampling steps. Then 
a (T+1) 
we update z3 with a sample z3 drawn from 


p(z aT”, z5 tD) (14.44) 


and so on, cycling through the three variables in turn. Gibbs sampling is summarized 
in Algorithm 14.3. 

To show that this procedure samples from the required distribution, we first note 
that the distribution p(z) is an invariant of each of the Gibbs sampling steps individu- 
ally and hence of the whole Markov chain. This follows since when we sample from 

plalz), the marginal distribution p(z\;) is clearly invariant because the value of 

i is unchanged. Also, each step by definition samples from the correct conditional 

dribun p(z;|Z\;). Because these conditional and marginal distributions together 
specify the joint distribution, we see that the joint distribution is itself invariant. 

The second requirement to be satisfied to ensure that the Gibbs sampling proce- 
dure samples from the correct distribution is that it is ergodic. A sufficient condition 
for ergodicity is that none of the conditional distributions are anywhere zero. If this 
is the case, then any point in z-space can be reached from any other point in a finite 
number of steps involving one update of each of the component variables. If this 
requirement is not satisfied, so that some of the conditional distributions have zeros, 
then ergodicity, if it applies, must be proven explicitly. 
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Algorithm 14.3: Gibbs sampling 


Input: Initial values {z;: i € 1,..., M} 
Conditional distributions {p(z;|{z;4i}) : i€ 1,...,M} 
Number of iterations T 

Output: Final values {z;: i € 1,..., M} 


for 7 € {1,...,T}do 
fori € {1,...,M}do 
| zi ~ plzil{zjzi}) 
end for 

end for 

return {z; : i € 1,..., M} 


The distribution of initial states must also be specified to complete the algorithm, 
although samples drawn after many iterations will effectively become independent 
of this distribution. Of course, successive samples from the Markov chain will be 
highly correlated, and so to obtain samples that are nearly independent it will be 
necessary to sub-sample the sequence. 

We can obtain the Gibbs sampling procedure as a particular instance of the 
Metropolis—Hastings algorithm as follows. Consider a Metropolis—Hastings sam- 
pling step involving the variable z in which the remaining variables z\ remain 
fixed, and for which the transition probability from z to z* is given by qą(z*|z) = 
p(z%|Z\x). Note that 2 = Z\, because these components are unchanged by the 


sampling step. Also, p(z) = p(zx|Z\x)p(4\x). Thus, the factor that determines the 
acceptance probability in the Metropolis—Hastings (14.40) is given by 


=j (14.45) 


Ae z) = Pall) _ PERPE) 
? plz)q(z*|z) — p(zk|z\k)p(z ne) P( ZF |Z x) 


where we have used By i = Z\,. Thus, the Metropolis—Hastings steps are always 
accepted. 

As with the Metropolis algorithm, we can gain some insight into the behaviour of 
Gibbs sampling by investigating its application to a Gaussian distribution. Consider 
a correlated Gaussian in two variables, as illustrated in Figure 14.11, having con- 
ditional distributions of width l and marginal distributions of width L. The typical 
step size is governed by the conditional distributions and will be of order l. Because 
the state evolves according to a random walk, the number of steps needed to obtain 
independent samples from the distribution will be of order (L/1)?. Of course if the 
Gaussian distribution were uncorrelated, then the Gibbs sampling procedure would 
be optimally efficient. For this simple problem, we could rotate the coordinate sys- 
tem such that the new variables are uncorrelated. However, in practical applications 
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Illustration of Gibbs sampling by al- 22 
ternate updates of two variables 
whose distribution is a correlated 
Gaussian. The step size is governed 
by the standard deviation of the con- 
ditional distribution (green curve), 
and is O(1), leading to slow progress 
in the direction of elongation of the 
joint distribution (red ellipse). The 
number of steps needed to obtain an 
independent sample from the distri- 
bution is O((L/l)*). 


it will generally be infeasible to find such transformations. 

One approach to reducing the random walk behaviour in Gibbs sampling is 
called over-relaxation (Adler, 1981). In its original form, it applies to problems for 
which the conditional distributions are Gaussian, which represents a more general 
class of distributions than the multivariate Gaussian because, for example, the non- 
Gaussian distribution p(z, y) x exp(—zy”) has Gaussian conditional distributions. 
At each step of the Gibbs sampling algorithm, the conditional distribution for a par- 
ticular component z; has some mean ju; and some variance o?. In the over-relaxation 
framework, the value of z; is replaced with 


Zi = pi t ailzi — pi) +o;(1 o2) (14.46) 


where v is a Gaussian random variable with zero mean and unit variance, and a 
is a parameter such that —1 < a < 1. Fora = 0, the method is equivalent to 
standard Gibbs sampling, and for a < 0 the step is biased to the opposite side of the 
mean. This step leaves the desired distribution invariant because if z; has mean pi 
and variance gł, then so too does z/. The effect of over-relaxation is to encourage 
directed motion through state space when the variables are highly correlated. The 
framework of ordered over-relaxation (Neal, 1999) generalizes this approach to non- 
Gaussian distributions. 

The practical applicability of Gibbs sampling depends on the ease with which 
samples can be drawn from the conditional distributions p(z,|z\ x). For probability 
distributions specified using directed graphical models, the conditional distributions 
for individual nodes depend only on the variables in the corresponding Markov blan- 
ket, as illustrated in Figure 14.12. For directed graphs, a wide choice of conditional 
distributions for the individual nodes conditioned on their parents will lead to condi- 
tional distributions for Gibbs sampling that are log concave. The adaptive rejection 
sampling methods discussed in Section 14.1.4 therefore provide a framework for 
Monte Carlo sampling from directed graphs with broad applicability. 
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Figure 14.12 The Gibbs sampling method requires samples 


Section 14.1.5 


to be drawn from the conditional distribution 
of a variable z conditioned on the remaining 


variables. For directed graphical models, this 
conditional distribution is a function of only the 
states of the nodes in the Markov blanket, 


shaded in blue, which comprises the parents, 
the children, and the co-parents. 


Because the basic Gibbs sampling technique considers one variable at a time, 
there are strong dependencies between successive samples. At the opposite extreme, 
if we could draw samples directly from the joint distribution (an operation that we 
are supposing is intractable), then successive samples would be independent. We 
can hope to improve on the simple Gibbs sampler by adopting an intermediate strat- 
egy in which we sample successively from groups of variables rather than individual 
variables. This is achieved in the blocking Gibbs sampling algorithm by choosing 
blocks of variables, not necessarily disjoint, and then sampling jointly from the vari- 
ables in each block in turn, conditioned on the remaining variables (Jensen, Kong, 
and Kjaerulff, 1995). 


14.2.5 Ancestral sampling 


For many models, the joint distribution p(z) is conveniently specified in terms 
of a graphical model. For a directed graph with no observed variables, it is straight- 
forward to sample from the joint distribution using the following ancestral sampling 
approach. The joint distribution is specified by 


M 
p(z) = [[p@ilpa@) (14.47) 


where z; are the set of variables associated with node i, and pa(i) denotes the set 
of variables associated with the parents of node 7. To obtain a sample from the joint 
distribution, we make one pass through the set of variables in the order z,..., Zag 
sampling from the conditional distributions p(z;|pa(i)). This is always possible 
because at each step, all the parent values will have been instantiated. After one pass 
through the graph, we will have obtained a sample from the joint distribution. This 
assumes that it is possible to sample from the individual conditional distributions at 
each node. 

Now consider a directed graph in which some of the nodes, which comprise the 
evidence set E, are instantiated with observed values. We can in principle extend 
the above procedure, at least for nodes representing discrete variables, to give the 
following logic sampling approach (Henrion, 1988), which can be seen as a special 
case of importance sampling. At each step, when a sampled value is obtained for a 
variable z; whose value is observed, the sampled value is compared to the observed 
value, and if they agree then the sample value is retained and the algorithm proceeds 
to the next variable in turn. However, if the sampled value and the observed value 
disagree, then the whole sample so far is discarded and the algorithm starts again 
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with the first node in the graph. This algorithm samples correctly from the posterior 
distribution because it corresponds simply to drawing samples from the joint distri- 
bution of hidden variables and data variables and then discarding those samples that 
disagree with the observed data (with the slight saving of not continuing with the 
sampling from the joint distribution as soon as one contradictory value is observed). 
However, the overall probability of accepting a sample from the posterior decreases 
rapidly as the number of observed variables increases and as the number of states that 
those variables can take increases, and so this approach is rarely used in practice. 

An improvement on this approach is called likelihood weighted sampling (Fung 
and Chang, 1990; Shachter and Peot, 1990). It is based on ancestral sampling com- 
bined with importance sampling. For each variable in turn, if that variable is in the 
evidence set, then it is just set to its instantiated value. If it is not in the evidence set, 
then it is sampled from the conditional distribution p(z;|pa(7)) in which the condi- 
tioning variables are set to their currently sampled values. The weighting associated 
with the resulting sample z is then given by 


r(z) = Il fear Ul mlb) _ Ul Heats: (14.48) 


ge D(2i|pati 


zice zice 


This method can be further extended using self-importance sampling (Shachter and 
Peot, 1990) in which the importance sampling distribution is continually updated to 
reflect the current estimated posterior distribution. 


Langevin Sampling 


The Metropolis—Hastings algorithm draws samples from a probability distribution 
by creating a Markov chain of candidate samples using a proposal distribution and 
then accepting or rejecting them using the criterion (14.40). This can be relatively 
inefficient since the proposal distribution is often a simple, fixed distribution that can 
generate updates in any direction in the data space, leading to a random walk. 

We have seen that when training neural networks, it is hugely advantageous to 
make use of the gradient of the log likelihood with respect to the learnable param- 
eters of the model in order to maximize the likelihood function. By analogy, we 
can introduce Markov chain sampling algorithms that make use of the gradient of 
the probability density with respect to the data vector so as to take steps that pref- 
erentially move towards regions of higher probability. One such technique is called 
Hamiltonian Monte Carlo, also known as hybrid Monte Carlo. This again makes 
use of a Metropolis acceptance test (Duane et al., 1987; Bishop, 2006). Here we will 
focus on a different approach that is widely used in deep learning, called Langevin 
sampling. Although it avoids the use of an acceptance test, the algorithm has to be 
designed carefully to ensure that the resulting samples are unbiased. An important 
application of Langevin sampling arises in the context of machine learning models 
defined in terms of energy functions. 
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Exercise 14.16 


Exercise 14.17 


14.3.1 Energy-based models 


Many generative models can be expressed as conditional probability distribu- 
tions p(x|w) where x is the data vector and w represents a vector of learnable pa- 
rameters. Such models can be trained by maximizing the corresponding likelihood 
function defined with respect to a training data set. However, to represent a valid 
probability distribution, the model must satisfy 


f p(x|w)p(x) dx = 1. (14.49) 


Ensuring that this requirement is met can significantly limit the allowable forms 
for the model. If we put aside the normalization constraint then we can consider 
a much broader class of models called energy-based models (LeCun et al., 2006). 
Suppose we have a function E(x, w), called the energy function, which is a real- 
valued function of its arguments but which has no other constraints. The exponential 
exp{—E(x,w)} is a non-negative quantity and can therefore be viewed as an un- 
normalized probability distribution over x. Here the introduction of the minus sign 
in the exponent is simply a convention, and it means that higher values of energy cor- 
respond to lower values of probability. We can then define a normalized distribution 
using 


p(x|w) = exp {—E(x, w)} (14.50) 


1 
Z(w) 
where the normalizing constant Z (w), known as the partition function, is defined by 


Z(w) = [evie dx. (14.51) 


The energy function is often modelled using a deep neural network with input vector 
x and a scalar output E(x, w), where w represents the weights and biases in the 
network. 

Note that the partition function depends on w, which creates problems for train- 
ing. For example, the log likelihood function for a data set D = (x1,..., Xy) of 
i.i.d. data has the form 


In p(D|w) = -5r Xn, w) — N In Z(w). (14.52) 


To compute the gradient of ln p(D|w) with respect to w, we need to know the form 
of Z(w). However, for many choices of the energy function E(x, w), it will be 
impractical to evaluate the partition function in (14.51) because this involves inte- 
grating (or summing for discrete variables) over all the whole of x-space. The term 
‘energy-based model’ is generally used for models where this integral is intractable. 
Note, however, that probabilistic models can be seen as special cases of energy-based 
models, and therefore many of the models discussed in this book can be viewed as 
energy-based models. The big advantage of energy-based models, therefore, is their 
flexibility in that they bypass the requirement for normalization. A corresponding 
disadvantage, however, is that since the normalizing constant is unknown, they can 
be more difficult to train. 
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14.3.2 Maximizing the likelihood 


Various approximation methods have been developed to train energy-based mod- 
els without having to evaluate the partition function (Song and Kingma, 2021). Here 
we look at techniques based on Markov chain Monte Carlo. An alternative approach, 
called score matching, will be discussed in the context of diffusion models. 

We have seen that for energy-based models, the likelihood function cannot be 
evaluated explicitly due to the unknown partition function Z(w). However, we can 
make use of Monte Carlo sampling methods to approximate the gradient of the log 
likelihood with respect to the model parameters. Once an energy-based model has 
been trained, by whatever means, we also need a way to draw samples from the 
model, and again we can make use of Monte Carlo methods. 

Using (14.50), the gradient, with respect to the model parameters, of the log 
likelihood function for an energy-based model can be written in the form 


Vw Inp(x|w) = —VwE(x, w) — Vw ln Z(w). (14.53) 


This is the likelihood function for a single data point x, but in practice we want to 
maximize the likelihood defined over a training set of data points drawn from some 
unknown distribution pp(x). If we assume the data points are i.id., then we can 
consider the gradient of the expectation of the log likelihood with respect to pp(x), 
which is then given by 


Ex~pp [Vw ln p(x|w)] = —Ex~pp [VwE(x, w)] — Vw ln Z(w) (14.54) 
where we have made use of the fact that the final term —V w In Z (w) does not depend 


on x and can therefore be taken outside the expectation. The partition function Z (w) 
is assumed to be unknown, but we can make use of (14.51) and rearrange to obtain 


-Vw ln Z(w) = f {VwE(x, w)} p(x|w) dx. (14.55) 


The right-hand side of (14.55) corresponds to an expectation over the model distri- 
bution p(x|w) given by 


i, {VwE(x, w)} p(x|w) dx = Ex~m [VwE(x, w)]. (14.56) 


Combining (14.54), (14.55), and (14.56) we obtain 


VwEx~po [In p(x|w)| = —Ex~pp [VwE(x, w)] 
+ Lew (x) [VwE(x, w)| . (14.57) 


This result is illustrated in Figure 14.13, and has a nice interpretation, as follows. Our 
goal is to find values for the parameters w that maximize the likelihood function, and 
therefore consider a small change to w in the direction of the gradient Vw In p(x|w). 
From (14.57) we see that expected value of this gradient can be expressed as two 
terms, having opposite signs. The first term on the right-hand side of (14.57) acts 
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Figure 14.13 


Illustration of the training of an energy-based model by maximizing the likelinood, show- 
ing the energy function E(x, w) in green along with the associated model distribution 
pm(a) and the true data distribution pp (x). Increasing the expected log likelihood by us- 
ing (14.57) pushes the energy function up at points corresponding to samples from the 
model (shown as blue dots) and pushes it down at points corresponding to samples from 
the data set (shown as red dots). 


to decrease E(x, w), and therefore to increase the probability density defined by 
the model, for points x drawn from pp(x). The second term on the right-hand 
side of (14.57) acts to increase the value of E(x, w), and therefore to decrease the 
probability density defined by the model, for data points drawn from the model itself. 
In regions where the model density exceeds the training data density, the net effect 
will be to increase the energy and therefore reduce the probability. Conversely, in 
regions where training data density exceeds the model density, the net effect will be 
to reduce the energy and therefore increase the probability density. Together these 
two terms move probability mass away from regions where there is a low density of 
training data and towards regions of high data density, as desired. The two terms will 
be equal in magnitude when the model distribution matches the data distribution, at 
which point the gradient on the left-hand-side of (14.57) will equal zero. 


14.3.3 Langevin dynamics 


When applying (14.57) as a practical training method, we need to approximate 
the two terms on the right-hand side. For any given value of x, we can evaluate 
Vw(x, w) using automatic differentiation. For the first term in (14.57), we can 
use the training data set to estimate the expectation over x: 


Expr [Vw E(x, w)] FEY E(xn, w (14.58) 
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The second term is more challenging because we need to draw samples from the 
model distribution defined by an energy function whose corresponding partition 
function is intractable. This can be done using Markov chain Monte Carlo methods. 
One popular approach is called stochastic gradient Langevin dynamics or simply 
Langevin sampling (Parisi, 1981; Welling and Teh, 2011). This term depends on 
the distribution p(x|w) only through the score function, which is defined to be the 
gradient of the log likelihood with respect to the data vector x, and is given by 


s(x, w) = Vx Inp(x|w). (14.59) 


It is worth emphasising that this gradient is taken with respect to the data point x and 
is therefore not the usual gradient with respect to the learnable parameters w. If we 
substitute (14.50) into (14.59) we obtain 


s(x, w) = —V, E(x, w) (14.60) 


where we see that the partition function no longer appears at it is independent of x. 
We start by drawing an initial value x) from a prior distribution, and then we 
iterate the following Markov chain steps: 


x) = x 4 nV, Inp(x™, w) + /2ne™, 7 €1,...,T7 (14.61) 


where ef) ~ A/(0,1) are independent samples from a zero-mean, unit-covariance 
Gaussian distribution, and the parameter 7) controls the step size. Each iteration of the 
Langevin equation takes a step in the direction of the gradient of the log likelihood, 
and then adds Gaussian noise. It can be show that, in the limits of 7 — 0 and 
T — œ, the value of z'7) is an independent sample from the distribution p(x). 
Langevin sampling is summarized in Algorithm 14.4. 

We can repeat the process to generate a set of samples {x),..., Xm } and then 
approximate the second term in (14.57) using 


Ex~paa(x) [Vw (x, w)] = 55 75 VwE(xm, w (14.62) 


Running long Markov chains to generate independent samples can be compu- 
tationally expensive, and so we need to consider practical approximations. One ap- 
proach is called contrastive divergence (Hinton, 2002). Here the samples used to 
evaluate (14.62) are obtained by running a Monte Carlo chain starting with one of 
the training data points x„. If the chain is run for a large number of steps, then the 
resulting value will be essentially an unbiased sample from the model distribution. 
Instead Hinton (2002) proposes running for only a few steps of Monte Carlo, perhaps 
even as few as one step, which is computationally much less costly. The resulting 
sample will be far from unbiased and will lie close to the data manifold. As a result, 
the effect of using gradient descent will be to shape the energy surface, and hence 
the probability density, only in the neighbourhood of the data manifold. This can 
prove effective for tasks such as discrimination but is expected to be less effective in 
learning a generative model. 
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14.3 


14.4 


14.5 


14.6 


14.7 


Algorithm 14.4: Langevin sampling 


Input: Initial value x 
Probability density p(x, w) 
Learning rate parameter 7) 
Number of iterations T 


Output: Final value x‘) 


x — Xp 
for 7 € {1,...,T}do 

e ~ N (e|0, I) 

x+—x+7V, Inp(x, w) + y2ne 
end for 


return x // Final value x) 


(x) Show that f defined by (14.2) is an unbiased estimator, in other words that the 
expectation of the right-hand side is equal to E[f(z)]. 


(x) Show that f defined by (14.2) has variance given by (14.4). 


(x) Suppose that z is a random variable with uniform distribution over (0, 1) and that 
we transform z using y = h~'(z) where h(y) is given by (14.6). Show that y has 
the distribution p(y). 


(x x) Given a random variable z that is uniformly distributed over (0, 1), find a trans- 
formation y = f(z) such that y has a Cauchy distribution given by (14.8). 


(x x) Suppose that zı and z3 are uniformly distributed over the unit circle, as shown in 
Figure 14.3, and that we make the change of variables given by (14.10) and (14.11). 
Show that (y1, y2) will be distributed according to (14.12). 


(x x) Let z be a D-dimensional random variable having a Gaussian distribution with 
zero mean and unit covariance matrix, and suppose that the positive definite sym- 
metric matrix © has the Cholesky decomposition © = LLT, where L is a lower- 
triangular matrix (i.e., one with zeros above the leading diagonal). Show that the 
variable y = y+ Lz has a Gaussian distribution with mean p and covariance X. 
This provides a technique for generating samples from a general multivariate Gaus- 
sian using samples from a univariate Gaussian having zero mean and unit variance. 


(x x) In this exercise, we show more carefully that rejection sampling does indeed 
draw samples from the desired distribution p(z). Suppose the proposal distribution 
is q(z). Show that the probability of a sample value z being accepted is given by 


14.8 


14.9 


14.10 


14.11 


14.12 


14.13 


Figure 14.14 


14.14 
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p(z)/kq(z) where p is any unnormalized distribution that is proportional to p(z), 
and the constant k is set to the smallest value that ensures kq(z) > p(z) for all 
values of z. Note that the probability of drawing a value z is given by the probability 
of drawing that value from q(z) times the probability of accepting that value given 
that it has been drawn. Make use of this, along with the sum and product rules of 
probability, to write down the normalized form for the distribution over z, and show 
that it equals p(z). 


(x) Suppose that z has a uniform distribution over the interval [0, 1]. Show that the 
variable y = btan z + c has a Cauchy distribution given by (14.16). 


(x x) Determine expressions for the coefficients k; in the envelope distribution (14.17) 
for adaptive rejection sampling using the requirements of continuity and normaliza- 
tion. 


(x x) By making use of the technique discussed in Section 14.1.2 for sampling from a 
single exponential distribution, devise an algorithm for sampling from the piecewise 
exponential distribution defined by (14.17). 


(x) Show that the simple random walk over the integers defined by (14.28), (14.29), 
and (14.30) has the property that E[(z‘)?] = E[(z°*~)))?] + 1/2 and hence by 
induction that E[(z‘7))?] = 7/2. 


(x x) Show that the Gibbs sampling algorithm, discussed in Section 14.2.4, satisfies 
detailed balance as defined by (14.34). 


(x) Consider the distribution shown in Figure 14.14. Discuss whether the standard 
Gibbs sampling procedure for this distribution is ergodic and therefore whether it 
would sample correctly from this distribution 


A probability distribution over two vari- 29 
ables z1 and z2 that is uniform over 

the shaded regions and zero everywhere 

else. 
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(x) Verify that the over-relaxation update (14.46), in which z; has mean u; and vari- 
ance g; and where v has zero mean and unit variance gives a value z; with mean 1; 
and variance o?. 
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14.15 


14.16 


14.17 


14.18 


(x) Show that in likelihood weighted sampling from a directed graph the importance 
sampling weights are given by (14.48). 


(x) Show that the distribution (14.50) is normalized with respect to x provided Z (w) 
satisfies (14.51). 


(x x) By making use of (14.50) show that the gradient of the log likelihood function 
for an energy-based model can be written in the form (14.52). 


(x x) By making use of (14.54), (14.55), and (14.56), show that the gradient of the 
log likelihood function for an energy-based model can be written in the form (14.57). 


Chapter 11 


Check for 
updates 


19 


Discrete 
Latent Variables 


We have seen how complex distributions can be constructed by combining multi- 
ple simple distributions and how the resulting models can be described by directed 
graphs. In addition to the observed variables, which form part of the data set, such 
models often introduce additional hidden, or latent, variables. These might corre- 
spond to specific quantities involved in the data generation process, such as the un- 
known orientation of an object in three-dimensional space in the case of images, or 
they may be introduced simply as modelling constructs to allow much richer models 
to be created. If we define a joint distribution over observed and latent variables, the 
corresponding distribution of the observed variables alone is obtained by marginal- 
ization. This allows relatively complex marginal distributions over observed vari- 
ables to be expressed in terms of more tractable joint distributions over the expanded 
space of observed and latent variables. 

In this chapter, we will see that marginalizing over discrete latent variables gives 
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rise to mixture distributions. Our focus will be on mixtures of Gaussians that pro- 
vide a good illustration of mixture distributions and that are also widely used in 
machine learning. One simple application for mixture models is to discover clusters 
in data, and we begin our discussion by considering a technique for clustering called 
the K-means algorithm, which corresponds to a particular non-probabilistic limit of 
Gaussian mixtures. Then we introduce the latent-variable view of mixture distribu- 
tions in which the discrete latent variables can be interpreted as defining assignments 
of data points to specific components of the mixture. 

A general technique for finding maximum likelihood estimators in latent-variable 
models is the expectation—maximization (EM) algorithm. We first use the Gaussian 
mixture distribution to motivate the EM algorithm in an informal way, and then we 
give a more careful treatment based on the latent-variable viewpoint. Finally we pro- 
vide a general perspective by introducing the evidence lower bound (ELBO), which 
will play an important role in generative models such as variational autoencoders 
and diffusion models. 


k-means Clustering 


We begin by considering the problem of identifying groups, or clusters, of data points 
in a multi-dimensional space. Suppose we have a data set {x;, . . . , Xy } consisting of 
N observations of a D-dimensional Euclidean variable x. Our goal is to partition the 
data set into some number K of clusters, where we will suppose for the moment that 
the value of K is given. Intuitively, we might think of a cluster as comprising a group 
of data points whose inter-point distances are small compared with the distances to 
points outside the cluster. We can formalize this notion by first introducing a set 
of D-dimensional vectors wą, where k = 1,..., K, in which p, is a ‘prototype’ 
associated with the kth cluster. As we will see shortly, we can think of the p, as 
representing the centres of the clusters. Our goal is then to find a set of cluster 
vectors { u; }, along with an assignment of data points to clusters, such that the sum 
of the squares of the distances of each data point to its closest cluster vector 4, is a 
minimum. 

It is convenient at this point to define some notation to describe the assignment 
of data points to clusters. For each data point xn, we introduce a corresponding set 
of binary indicator variables rng € {0,1}, where k = 1,..., K. These indicators 
describe which of the K clusters the data point x,, is assigned to, so that if data point 
Xn is assigned to cluster k then rng = 1, and r,,; = 0 for j 4 k. This is an example 
of the 1-of-K coding scheme. We can then define an error function: 


K 


N 
J=Ņ_ X rarllXn — el, (15.1) 
n=1 


k=1 


which represents the sum of the squares of the distances of each data point to its 
assigned vector p4;,. Our goal is to find values for the {rnx} and the {,2,} so as to 
minimize J. We can do this through an iterative procedure in which each iteration 
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involves two successive steps corresponding to successive optimizations with respect 
to the {rng } and the {1,}. First we choose some initial values for the {u}. Then 
in the first step, we minimize J with respect to the {rnp }, keeping the {uy } fixed. 
In the second step, we minimize J with respect to the {pu} }, keeping {r,;,} fixed. 
This two-step optimization is then repeated until convergence. We will see that these 
two stages of updating {rng} and updating {p} correspond, respectively, to the 
E (expectation) and M (maximization) steps of the EM algorithm, and to emphasize 
this, we will use the terms E step and M step in the context of the K-means algorithm. 

Consider first the determination of the {r,;} with the {up} held fixed (the E 
step). Because J in (15.1) is a linear function of the {rn}, this optimization can be 
performed easily to give a closed-form solution. The terms involving different n are 
independent, and so we can optimize for each n separately by choosing rnp to be 1 
for whichever value of k gives the minimum value of ||x,, — 2;,||*. In other words, 
we simply assign the nth data point to the closest cluster centre. More formally, this 
can be expressed as 


ae "O 2 
mop Jb If k = arg min; [xn = uy|?, (15.2) 
0, otherwise. 


Now consider the optimization of the {u} with the {rng} held fixed (the M 
step). The objective function J is a quadratic function of pẹ, and it can be minimized 
by setting its derivative with respect to u, to zero giving 


N 
2X Trak(Xn — Be) = 0, (15.3) 
n=1 


which we can easily solve for jz;, to give 


u, = Xn TnkXn 
£ Ea Tnk 


The denominator in this expression is equal to the number of points assigned to 
cluster k, and so this result has a simple interpretation, namely that jz, is equal 
to the mean of all the data points x,, assigned to cluster k. For this reason, the 
procedure is known as the K-means algorithm (Lloyd, 1982). It is summarized in 
Algorithm 15.1. Because the assignments {rnp } are discrete and each iteration will 
not lead to an increase in the error function, the k-means algorithm is guaranteed to 
converge in a finite number of steps. 

The two phases of reassigning data points to clusters and recomputing the cluster 
means are repeated in turn until there is no further change in the assignments (or until 
some maximum number of iterations is exceeded). However, this approach may 
converge to a local rather than a global minimum of J. The convergence properties 
of the K’-means algorithm were studied by MacQueen (1967). 

The K-means algorithm is illustrated in Figure 15.1 using data derived from 
eruptions of the Old Faithful geyser in Yellowstone National Park. The data set 
consists of 272 data points, each of which gives the duration of an eruption on the 


(15.4) 
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Algorithm 15.1: k-means algorithm 


Input: Initial prototype vectors j1,,..., Hg 
Data set x1,..., XN 
Output: Final prototype vectors j4,,..., Hg 


{rnk < 0} // Initially set all assignments to zero 
repeat 

(old) 
{tax F — {rae} 


// Update assignments 
for N € {1,..., N} do 


k + arg minj ||Xn — e;||? 


Tnk <1 
Tnj 0, Geilo K}, Zk 
end for 
// Update prototype vectors 
for k € {1,..., K} do 
| Me don PnkXn/ don Pak 
end for 


until {rn} = {rly // Assignments unchanged 


return H4,..., Hg, {Tnk} 


horizontal axis and the time to the next eruption on the vertical axis. Here we have 
made a linear re-scaling of the data, known as standardizing, such that each of the 
variables has zero mean and unit standard deviation. 

For this example, we have chosen K = 2 and so the assignment of each data 
point to the nearest cluster centre is equivalent to a classification of the data points 
according to which side they lie of the perpendicular bisector of the two cluster 
centres. A plot of the cost function J given by (15.1) for the Old Faithful example is 
shown in Figure 15.2. Note that we have deliberately chosen poor initial values for 
the cluster centres so that the algorithm takes several steps before convergence. In 
practice, a better initialization procedure would be to choose the cluster centres yy to 
be equal to a random subset of K data points. Also note that the K’-means algorithm 
is often used to initialize the parameters in a Gaussian mixture model before applying 
the EM algorithm. 

So far, we have considered a batch version of k-means in which the whole data 
set is used together to update the prototype vectors. We can also derive a sequential 
update in which, for each data point x,, in turn, we update the nearest prototype py 
using 
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Figure 15.1 Illustration of the A-means algorithm using the re-scaled Old Faithful data set. (a) Green points 
denote the data set in a two-dimensional Euclidean space. The initial choices for centres jz, and jz. are shown 
by the red and blue crosses, respectively. (b) In the initial E step, each data point is assigned either to the red 
cluster or to the blue cluster, according to which cluster centre is nearer. This is equivalent to classifying the 
points according to which side of the perpendicular bisector of the two cluster centres, shown by the magenta 
line, they lie. (c) In the subsequent M step, each cluster centre is recomputed to be the mean of the points 
assigned to the corresponding cluster. (d)—(i) show successive E and M steps through to final convergence of 
the algorithm. 
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Figure 15.2 Plot of the cost function J 
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given by (15.1) after each 1000 + 
E step (blue points) and M 

step (red points) of the K- 

means algorithm for the ex- J 
ample shown in Figure 15.1. 

The algorithm has converged 

after the third M step, and 500+ 
the final EM cycle produces 
no changes in either the as- 
signments or the prototype 
vectors. 


wR = ut + (eq — ug) (15.5) 
Nk 
where Nọ is the number of data points that have so far been used to update jz,,. This 
allows each data point to be used once and then discarded before seeing the next data 
point. 

One notable feature of the K-means algorithm is that at each iteration, every 
data point is assigned to one, and only one, of the clusters. Although some data 
points will be much closer to a particular centre u, than to any other centre, there 
may be other data points that lie roughly midway between cluster centres. In the 
latter case, it is not clear that the hard assignment to the nearest cluster is the most 
appropriate. We will see that by adopting a probabilistic approach, we obtain ‘soft’ 
assignments of data points to clusters in a way that reflects the level of uncertainty 
over the most appropriate assignment. This probabilistic formulation has numerous 
benefits. 


15.1.1 Image segmentation 


As an illustration of the application of the AK-means algorithm, we consider 
the related problems of image segmentation and image compression. The goal of 
segmentation is to partition an image into regions such that each region has a rea- 
sonably homogeneous visual appearance or which corresponds to objects or parts 
of objects (Forsyth and Ponce, 2003). Each pixel in an image is a point in a three- 
dimensional space comprising the intensities of the red, blue, and green channels, 
and our segmentation algorithm simply treats each pixel in the image as a sepa- 
rate data point. Note that strictly this space is not Euclidean because the channel 
intensities are bounded by the interval [0,1]. Nevertheless, we can apply the K- 
means algorithm without difficulty. We illustrate the result of running K-means to 
convergence, for any particular value of K, by redrawing the image in which we 
replace each pixel vector with the {R, G, B} intensity triplet given by the centre py 
to which that pixel has been assigned. Results for various values of K are shown in 
Figure 15.3. We see that for a given value of K, the algorithm represents the image 
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Original image 


Figure 15.3 An example of the application of the K-means clustering algorithm to image segmentation showing 
an initial image together with their k-means segmentations obtained using various values of K. This also illus- 
trates the use of vector quantization for data compression, in which smaller values of K give higher compression 
at the expense of poorer image quality. 


using a palette of only K colours. It should be emphasized that this use of k-means 
is not a particularly sophisticated approach to image segmentation, not least because 
it takes no account of the spatial proximity of different pixels. Image segmentation 
is in general extremely difficult and remains the subject of active research and is 
introduced here simply to illustrate the behaviour of the K-means algorithm. 

We can also use a clustering algorithm to perform data compression. It is im- 
portant to distinguish between lossless data compression, in which the goal is to 
be able to reconstruct the original data exactly from the compressed representation, 
and lossy data compression, in which we accept some errors in the reconstruction 
in return for higher levels of compression than can be achieved in the lossless case. 
We can apply the K-means algorithm to the problem of lossy data compression as 
follows. For each of the N data points, we store only the identity k of the cluster to 
which it is assigned. We also store the values of the K cluster centres {p} }, which 
typically requires significantly less data, provided we choose K < N. Each data 
point is then approximated by its nearest centre p. New data points can similarly 
be compressed by first finding the nearest u, and then storing the label k instead of 
the original data vector. This framework is often called vector quantization, and the 
vectors { u, } are called codebook vectors. 

The image segmentation problem discussed above also provides an illustration 
of the use of clustering for data compression. Suppose the original image has N 
pixels comprising { R, G, B} values, each of which is stored with 8 bits of precision. 
Directly transmitting the whole image would cost 24N bits. Now suppose we first 
run K-means on the image data, and then instead of transmitting the original pixel 
intensity vectors, we transmit the identity of the nearest vector u. Because there 
are K such vectors, this requires log, K bits per pixel. We must also transmit the 
K code book vectors {j1;,}, which requires 24K bits, and so the total number of 
bits required to transmit the image is 24K + N log, K (rounding up to the nearest 
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integer). The original image shown in Figure 15.3 has 240 x 180 = 43,200 pixels 
and so requires 24 x 43,200 = 1,036,800 bits to transmit directly. By comparison, 
the compressed images require 43,248 bits (K = 2), 86,472 bits (K = 3), and 
173,040 bits (K = 10), respectively, to transmit. These represent compression ratios 
compared to the original image of 4.2%, 8.3%, and 16.7%, respectively. We see that 
there is a trade-off between the degree of compression and image quality. Note that 
our aim in this example is to illustrate the k-means algorithm. If we had been aiming 
to produce a good image compressor, then it would be more fruitful to consider small 
blocks of adjacent pixels, for instance 5 x 5, and thereby exploit the correlations that 
exist in natural images between nearby pixels. 


Mixtures of Gaussians 


We have previously motivated the Gaussian mixture model as a simple linear super- 
position of Gaussian components, aimed at providing a richer class of density mod- 
els than a single Gaussian. We now turn to a formulation of Gaussian mixtures in 
terms of discrete latent variables. This will provide us with a deeper insight into this 
important distribution and will also serve to motivate the expectation—maximization 
algorithm. 

Recall from (3.111) that the Gaussian mixture distribution can be written as a 
linear superposition of Gaussians in the form 


K 
p(x) = y TN (x| up, Xk). (15.6) 


k=1 


Let us introduce a K-dimensional binary random variable z having a 1-of-K repre- 
sentation in which one of the elements is equal to 1 and all other elements are equal 
to 0. The values of z;, therefore satisfy z € {0,1} and >> k Zk = 1, and we see that 
there are K possible states for the vector z according to which element is non-zero. 
We will define the joint distribution p(x, z) in terms of a marginal distribution p(z) 
and a conditional distribution p(x|z). The marginal distribution over z is specified 
in terms of the mixing coefficients mk, such that 


D(z, = 1) = T% 
where the parameters {7;,} must satisfy 
O<m<1 (15.7) 


together with 


K 
> me =] (15.8) 
k=1 


15.2. Mixtures of Gaussians 467 


Figure 15.4 Graphical representation of a mixture model, in which the joint distribution is 


Exercise 15.3 


expressed in the form p(x, z) = p(z)p(x|z). 


if they are to be valid probabilities. Because z uses a 1-of- representation, we can 
also write this distribution in the form 


K 
p(z) = II fees (15.9) 
k=1 


Similarly, the conditional distribution of x given a particular value for z is a Gaus- 
sian: 
P(X 2% = 1) = N (x|Hp, Uk), 


which can also be written in the form 


K 


plz) = [NEn De)”. (15.10) 


k=1 


The joint distribution is given by p(z)p(x|z) and is described by the graphical model 
in Figure 15.4. The marginal distribution of x is then obtained by summing the joint 
distribution over all possible states of z to give 


K 
p(x) = > p(z)p(x|2) = So mN (x| Hx, Be) (15.11) 


k=1 


where we have made use of (15.9) and (15.10). Thus, the marginal distribution of x is 
a Gaussian mixture of the form (15.6). If we have several observations X1,..., XN, 
then, because we have represented the marginal distribution in the form p(x) = 
Š, P(X, Z), it follows that for every observed data point x, there is a corresponding 
latent variable z,,. 

We have therefore found an equivalent formulation of the Gaussian mixture in- 
volving explicit latent variables. It might seem that we have not gained much by do- 
ing so. However, we are now able to work with the joint distribution p(x, z) instead 
of the marginal distribution p(x), and this will lead to significant simplifications, 
most notably through the introduction of the EM algorithm. 

Another quantity that will play an important role is the conditional probability 
of z given x. We will use y(z;,) to denote p(z = 1|x), whose value can be found 
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using Bayes’ theorem: 


(zn) = p(zk = 1|x) = a = l)p(x|zk = 1) 


Yr = Ipla = 1) 


— TN (X| Hk, Zr) . (15.12) 


K 
NO iN (xu; £3) 
j=1 


We will view 7; as the prior probability of zx = 1, and the quantity y(zx) as the 
corresponding posterior probability once we have observed x. As we will see later, 
(zx) can also be viewed as the responsibility that component k takes for ‘explain- 
ing’ the observation x. 

We can use ancestral sampling to generate random samples distributed according 
to the Gaussian mixture model. To do this, we first generate a value for z, which we 
denote Z, from the marginal distribution p(z) and then generate a value for x from 
the conditional distribution p(x|Z). We can depict samples from the joint distribution 
p(x, zZ) by plotting points at the corresponding values of x and then colouring them 
according to the value of z, in other words according to which Gaussian component 
was responsible for generating them, as shown in Figure 15.5(a). Similarly samples 
from the marginal distribution p(x) are obtained by taking the samples from the joint 
distribution and ignoring the values of z. These are illustrated in Figure 15.5(b) by 
plotting the x values without any coloured labels. 

We can also use this synthetic data set to illustrate the ‘responsibilities’ by eval- 
uating, for every data point, the posterior probability for each component in the 
mixture distribution from which this data set was generated. In particular, we can 
represent the value of the responsibilities 7(z,,,) associated with data point x„ by 
plotting the corresponding point using proportions of red, blue, and green ink given 
by ¥(Znx) for k = 1, 2,3, respectively, as shown in Figure 15.5(c). So, for instance, 
a data point for which 7(2n1) = 1 will be coloured red, whereas one for which 
Y(Zn2) = Y(Zn3) = 0.5 will be coloured with equal proportions of blue and green 
ink and so will appear cyan. This should be compared with Figure 15.5(a) in which 
the data points were labelled using the true identity of the component from which 
they were generated. 


15.2.1 Likelihood function 


Suppose we have a data set of observations {x1,..., Xy }, and we wish to model 
this data using a mixture of Gaussians. We can represent this data set as an N x D 
matrix X in which the nth row is given by xT. From (15.6) the log of the likelihood 
function is given by 


N K 
In p(X|r, u, £) = Sng Som Ba} (15.13) 
al k=1 
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Figure 15.5 Example of 500 points drawn from the mixture of three Gaussians shown in Figure 3.8. (a) Sam- 
ples from the joint distribution p(z)p(x|z) in which the three states of z, corresponding to the three components 
of the mixture, are depicted in red, green, and blue, and (b) the corresponding samples from the marginal dis- 
tribution p(x), which is obtained by simply ignoring the values of z and just plotting the x values. The data set 
in (a) is said to be complete, whereas that in (b) is incomplete, as discussed further in Section 15.3. (c) The 
same samples in which the colours represent the value of the responsibilities (zx) associated with data point 
Xn, obtained by plotting the corresponding point using proportions of red, blue, and green ink given by y(znx) for 
k = 1, 2,3, respectively. 


Maximizing this log likelihood function (15.13) is a more complex problem than for 
a single Gaussian. The difficulty arises from the presence of the summation over k 
that appears inside the logarithm in (15.13), so that the logarithm function no longer 
acts directly on the Gaussian. If we set the derivatives of the log likelihood to zero, 
we will no longer obtain a closed-form solution, as we will see shortly. 

Before discussing how to maximize this function, it is worth emphasizing that 
there is a significant problem associated with the maximum likelihood framework 
when applied to Gaussian mixture models, due to the presence of singularities. For 
simplicity, consider a Gaussian mixture whose components have covariance matrices 
given by Xk = o7I, where I is the unit matrix, although the conclusions will hold 
for general covariance matrices. Suppose that one of the components of the mixture 
model, let us say the jth component, has its mean jz; exactly equal to one of the data 
points so that 4; = Xn for some value of n. This data point will then contribute a 
term in the likelihood function of the form 

N (Xn|Xn, c71) = — (15.14) 
If we consider the limit cj — 0, then we see that this term goes to infinity and 
so the log likelihood function will also go to infinity. Thus, the maximization of 
the log likelihood function is not a well posed-problem because such singularities 
will always be present and will occur whenever one of the Gaussian components 
‘collapses’ onto a specific data point. Recall that this problem did not arise with 
a single Gaussian distribution. To understand the difference, note that if a single 
Gaussian collapses onto a data point, it will contribute multiplicative factors to the 
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Illustration of how singularities 
in the likelihood function arise 
with mixtures of Gaussians. 
This should be compared with 
a single Gaussian shown in 
Figure 2.9 for which no singu- 
larities arise. p(x) 


likelihood function arising from the other data points, and these factors will go to 
zero exponentially fast, giving an overall likelihood that goes to zero rather than 
infinity. However, once we have (at least) two components in the mixture, one of 
the components can have a finite variance and therefore assign finite probability to 
all the data points while the other component can shrink onto one specific data point 
and thereby contribute an ever increasing additive value to the log likelihood. This 
is illustrated in Figure 15.6. These singularities provide an example of the over- 
fitting that can occur in a maximum likelihood approach. When applying maximum 
likelihood to Gaussian mixture models, we must take steps to avoid finding such 
pathological solutions and instead seek local maxima of the likelihood function that 
are well behaved. We can try to avoid the singularities by using suitable heuristics, 
for instance by detecting when a Gaussian component is collapsing and resetting 
its mean to a randomly chosen value while also resetting its covariance to some 
large value and then continuing with the optimization. The singularities can also be 
avoided by adding a regularization term to the log likelihood corresponding to a prior 
distribution over the parameters. 

A further issue in finding maximum likelihood solutions arises because for any 
given maximum likelihood solution, a K-component mixture will have a total of K! 
equivalent solutions corresponding to the K! ways of assigning K sets of parameters 
to K components. In other words, for any given (non-degenerate) point in the space 
of parameter values, there will be a further K! — 1 additional points all of which give 
rise to exactly the same distribution. This problem is known as identifiability (Casella 
and Berger, 2002) and is an important issue when we wish to interpret the parameter 
values discovered by a model. Identifiability will also arise when we discuss models 
having continuous latent variables. However, when finding a good density model, it 
is irrelevant because any of the equivalent solutions is as good as any other. 


15.2.2 Maximum likelihood 


An elegant and powerful method for finding maximum likelihood solutions for 
models with latent variables is called the expectation—maximization algorithm or EM 
algorithm (Dempster, Laird, and Rubin, 1977; McLachlan and Krishnan, 1997). In 
this chapter we will give three different derivations of the EM algorithm, each more 
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general than the previous. We begin here with a relatively informal treatment in 
the context of a Gaussian mixture model. We emphasize, however, that EM has 
broad applicability, and the underlying concepts will be encountered in the context 
of several different models in this book. 

We begin by writing down the conditions that must be satisfied at a maximum 
of the likelihood function. Setting the derivatives of In p(X|7, p, ©) in (15.13) with 
respect to the means p, of the Gaussian components to zero, we obtain 


(15.15) 


N 
TEN (Xn |My, Xk) =i 
(= Box, — 
aati, Ee E 
SEHR 


Y(Znk) 


where we have made use of the form (3.26) for the Gaussian distribution. Note 
that the posterior probabilities, or responsibilities, y(z,,) given by (15.12) appear 
naturally on the right-hand side. Multiplying by Xy (which we assume to be non- 
singular) and rearranging we obtain 


N 
1 
H= N; 22 p)x (15.16) 
where we have defined m 
Nz = $ (znz). (15.17) 
n=1 


We can interpret Nj, as the effective number of points assigned to cluster k. Note 
carefully the form of this solution. We see that the mean p, for the kth Gaussian 
component is obtained by taking a weighted mean of all the points in the data set, 
in which the weighting factor for data point x,, is given by the posterior probability 
Y(Znk) that component k was responsible for generating Xn. 

If we set the derivative of In p(X|7r, u, X) with respect to © to zero and follow 
a similar line of reasoning by making use of the result for the maximum likelihood 
solution for the covariance matrix of a single Gaussian, we obtain 


N 
1 
eae do Wnt) (Xn = Hy) Kn — Me)”, (15.18) 
n=1 


which has the same form as the corresponding result for a single Gaussian fitted to 
the data set, but again with each data point weighted by the corresponding poste- 
rior probability and with the denominator given by the effective number of points 
associated with the corresponding component. 

Finally, we maximize In p(X|z, u, ©) with respect to the mixing coefficients 
Tk. Here we must take account of the constraint (15.8), which requires the mixing 
coefficients to sum to one. This can be achieved using a Lagrange multiplier A and 
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maximizing the following quantity: 


K 
In p(X|a, u, £) +A dome), (15.19) 
k=1 
which gives 
N 
alphas Ek 
ae penalise (15.20) 


= > TiN (Xn |b; Xj) 


where again we see the appearance of the responsibilities. If we now multiply both 
sides by 7, and sum over k making use of the constraint (15.8), we find A = —N. 
Using this to eliminate \ and rearranging, we obtain 


: (15.21) 


Tk 
so that the mixing coefficient for the kth component is given by the average respon- 
sibility which that component takes for explaining the data points. 

Note that the results (15.16), (15.18), and (15.21) do not constitute a closed-form 
solution for the parameters of the mixture model because the responsibilities (zx) 
depend on those parameters in a complex way through (15.12). However, these re- 
sults do suggest a simple iterative scheme for finding a solution to the maximum 
likelihood problem, which as we will see turns out to be an instance of the EM algo- 
rithm for the particular case of the Gaussian mixture model. We first choose some 
initial values for the means, covariances, and mixing coefficients. Then we alternate 
between the following two updates, which we will call the E step and the M step for 
reasons that will become apparent shortly. In the expectation step, or E step, we use 
the current values for the parameters to evaluate the posterior probabilities, or re- 
sponsibilities, given by (15.12). We then use these probabilities in the maximization 
step, or M step, to re-estimate the means, covariances, and mixing coefficients using 
the results (15.16), (15.18), and (15.21). Note that in so doing, we first evaluate the 
new means using (15.16) and then use these new values to find the covariances using 
(15.18), in keeping with the corresponding result for a single Gaussian distribution. 
We will show that each update to the parameters resulting from an E step followed 
by an M step is guaranteed to increase the log likelihood function. In practice, the al- 
gorithm is deemed to have converged when the change in the log likelihood function, 
or alternatively in the parameters, falls below some threshold. 

We illustrate the EM algorithm for a mixture of two Gaussians applied to the 
re-scaled Old Faithful data in Figure 15.7. Here a mixture of two Gaussians is used, 
with centres initialized using the same values as for the K-means algorithm in Fig- 
ure 15.1 and with covariance matrices initialized to be proportional to the unit matrix. 
Plot (a) shows the data points in green, together with the initial configuration of the 
mixture model in which the one standard-deviation contours for the two Gaussian 
components are shown as blue and red circles. Plot (b) shows the result of the initial 
E step, in which each data point is depicted using a proportion of blue ink equal to 
the posterior probability of having been generated from the blue component and a 
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Figure 15.7 Application of the EM algorithm to the Old Faithful data set as used for the illustration of the 
K-means algorithm in Figure 15.1. See the text for details. 


corresponding proportion of red ink given by the posterior probability of having been 
generated by the red component. Thus, points that have a roughly equal probability 
for belonging to either cluster appear purple. The situation after the first M step is 
shown in plot (c), in which the mean of the blue Gaussian has moved to the mean of 
the data set, weighted by the probabilities of each data point belonging to the blue 
cluster. In other words it has moved to the centre of mass of the blue ink. Similarly, 
the covariance of the blue Gaussian is set equal to the covariance of the blue ink. 
Analogous results hold for the red component. Plots (d), (e), and (f) show the results 
after 2, 5, and 20 complete cycles of EM, respectively. In plot (f) the algorithm is 
close to convergence. 

Note that the EM algorithm takes many more iterations to reach (approximate) 
convergence compared with the k-means algorithm and that each cycle requires sig- 
nificantly more computation. It is therefore common to run the K-means algorithm 
to find a suitable initialization for a Gaussian mixture model that is subsequently 
adapted using EM. The covariance matrices can conveniently be initialized to the 
sample covariances of the clusters found by the -means algorithm, and the mix- 
ing coefficients can be set to the fractions of data points assigned to the respective 
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Figure 15.8 A Gaussian mixture model fitted 
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15.3. 


to the ‘two-moons’ data set, show- 
ing that a large number of mixture 
components may be required to 
give an accurate representation of 
a complex data distribution. Here 
the ellipses represent the contours 
of constant density for the corre- 
sponding mixture components. As 
we move to spaces of larger di- 
mensionality, the number of com- 
ponents required to model a distri- 
bution accurately can become un- 
acceptably large. 


clusters. Techniques such as parameter regularization must be employed to avoid 
singularities of the likelihood function in which a Gaussian component collapses 
onto a particular data point. It should be emphasized that there will generally be 
multiple local maxima of the log likelihood function and that EM is not guaranteed 
to find the largest of these maxima. Because the EM algorithm for Gaussian mixtures 
plays such an important role, we summarize it in Algorithm 15.2. 

Mixture models are very flexible and can approximate complicated distributions 
to high accuracy given a sufficient number of components if the model parameters 
are chosen appropriately. In practice, however, the number of components can be ex- 
tremely large, especially in spaces of high dimensionality. This problem is illustrated 
for the two-moons data set in Figure 15.8. Nevertheless, mixture models are useful 
in many applications. Also, an understanding of mixture models lays the foundations 
for models with continuous latent variables and for generative models based on deep 
neural networks, which have much better scaling to spaces of high dimensionality. 


Expectation—Maximization Algorithm 


We turn now to a more general view of the EM algorithm in which we focus on the 
role of latent variables. As before we denote the set of all observed data points by X, 
in which the nth row represents xT. Similarly, the corresponding latent variables will 
be denoted by an N x K matrix Z with rows z7. If we assume that the data points are 
drawn independently from the distribution, then we can express the Gaussian mixture 
model for this i.i.d. data set using the graphical representation shown in Figure 15.9. 
The set of all model parameters is denoted by 0, and so the log likelihood function 
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Algorithm 15.2: EM algorithm for a Gaussian mixture model 


Input: Initial model parameters {p;,}, {1}, {7k} 
Data set {x1,..-,Xn} 
Output: Final model parameters {u} }, {x}, {7%} 


repeat 
// E step 
for n € {1,...,N}do 
for k € {1,..., K} do 
Vee TN (Xl My Ee) 
Xj TIN (Kn My» By) 


j=1 


end for 


end for 

// M step 

for k € {1,..., K} do 
N 


Nee ys (2nk) 


n=1 


1 N 
—— Znak Kn 
Mi, Np 22 p) 


N 
1 
Xk N, Ne Y(Znk) (Xn — He) (Xn — Mi) 
n=1 


a Ne 
ma Ne 
N 


end for 
// Log likelihood 


N K 
Le Soin ISS TEN (Xnl Hk, zo} 


m= k=1 


until convergence 
return {x;}, {Xx}, {7} 
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Figure 15.9 Graphical representation of a Gaussian mixture model 
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for a set of N i.i.d. data points {xn }, with corresponding 
latent points {zn }, where n = 1,..., N. g 

dma Aia 
is given by 


In p(X|0) = nf Soe. z) (15.22) 

Z 
Note that our discussion will apply equally well to continuous latent variables simply 
by replacing the sum over Z with an integral. 

A key observation is that the summation over the latent variables appears inside 
the logarithm. Even if the joint distribution p(X, Z|@) belongs to the exponential 
family, the marginal distribution p(X|@) typically does not as a result of this sum- 
mation. The presence of the sum prevents the logarithm from acting directly on the 
joint distribution, resulting in complicated expressions for the maximum likelihood 
solution. 

Now suppose that, for each observation in X, we were told the corresponding 
value of the latent variable Z. We will call {X, Z} the complete data set, and we will 
refer to the actual observed data X as incomplete, as illustrated in Figure 15.5. The 
likelihood function for the complete data set simply takes the form In p(X, Z|@), and 
we will suppose that maximization of this complete-data log likelihood function is 
straightforward. 

In practice, however, we are not given the complete data set {X, Z} but only 
the incomplete data X. Our state of knowledge of the values of the latent variables 
in Z is given only by the posterior distribution p(Z|X, 0). Because we cannot use 
the complete-data log likelihood, we consider instead its expected value under the 
posterior distribution of the latent variables, which corresponds (as we will see) to the 
E step of the EM algorithm. In the subsequent M step, we maximize this expectation. 
If the current estimate for the parameters is denoted by 0°“, then a pair of successive 
E and M steps gives rise to a revised estimate 0°”. The algorithm is initialized 
by choosing some starting value for the parameters 8). Although this use of the 
expectation may seem somewhat arbitrary, we will see the motivation for this choice 
when we give a deeper treatment of EM in Section 15.4. 

In the E step, we use the current parameter values 0° to find the posterior 
distribution of the latent variables given by p(Z|X, 0°'"). We then use this posterior 
distribution to find the expectation of the complete-data log likelihood evaluated for 
some general parameter value 0. This expectation, denoted by Q(6, 0°“), is given 
by 

Q(0, 0°") = X` p(Z|X, 6°") In p(X, Z|9). (15:23) 
Z 
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Algorithm 15.3: General EM algorithm 


Input: Joint distribution p(X, Z|@) 
Initial parameters 9°" 


Data set x1,..., XN 
Output: Final parameters 0 


repeat 
(0, 0°") — z p(Z|X, 0”) In p(X, Z|O) // E step 
0”°™ ~ arg max, O(0,0°°) // M step 
L + p(X|0"") // Evaluate log likelihood 
got grew // Update the parameters 


until convergence 
return 6™°™ 


In the M step, we determine the revised parameter estimate 0"°” by maximizing this 
function: 
ore’ = arg max Q(0,0°"). (15.24) 
0 


Note that in the definition of Q(0, 0°"), the logarithm acts directly on the joint dis- 
tribution p(X, Z|), and so the corresponding M-step maximization will, according 
to our assumption, be tractable. The general EM algorithm is summarized in Al- 
gorithm 15.3. It has the property, as we will show later, that each cycle of EM will 
increase the incomplete-data log likelihood (unless it is already at a local maximum). 

The EM algorithm can also be used to find MAP (maximum posterior) solutions 
for models in which a prior p(@) is defined over the parameters. In this case the E 
step remains the same as in the maximum likelihood case, whereas in the M step the 
quantity to be maximized is given by Q(0, 0°") + In p(@). Suitable choices for the 
prior will remove the singularities of the kind illustrated in Figure 15.6. 

Here we have considered the use of the EM algorithm to maximize a likelihood 
function when there are discrete latent variables. However, it can also be applied 
when the unobserved variables correspond to missing values in the data set. The 
distribution of the observed values is obtained by taking the joint distribution of all 
the variables and then marginalizing over the missing ones. EM can then be used to 
maximize the corresponding likelihood function. This will be a valid procedure if 
the data values are missing at random, meaning that the mechanism causing values 
to be missing does not depend on the unobserved values. In many situations this will 
not be the case, for instance if a sensor fails to return a value whenever the quantity 
it is measuring exceeds some threshold. 
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Figure 15.10 This shows the same graph as in Figure 15.9 except that 


we now suppose that the discrete variables z„ are ob- 
served, as well as the data variables xn. 


15.3.1 Gaussian mixtures 


We now consider the application of this latent-variable view of EM to the spe- 
cific case of a Gaussian mixture model. Recall that our goal is to maximize the log 
likelihood function (15.13), which is computed using the observed data set X, and 
we saw that this was more difficult than with a single Gaussian distribution due to 
the summation over k that occurs inside the logarithm. Suppose then that in addi- 
tion to the observed data set X, we were also given the values of the corresponding 
discrete variables Z. Recall that Figure 15.5(a) shows a complete data set (i.e., one 
that includes labels showing which component generated each data point) whereas 
Figure 15.5(b) shows the corresponding incomplete data set. A graphical model for 
the complete data is shown in Figure 15.10. 

Now consider the problem of maximizing the likelihood for the complete data 
set {X, Z}. From (15.9) and (15.10), this likelihood function takes the form 


N K 
p(X, Zlu, £, r) = | [| | [77N (&nloeg, Be)?" (15.25) 
n=1 k=1 


where 2Z,,% denotes the kth component of zn. Taking the logarithm, we obtain 


N K 
Inp(X, Ziu, E, r) = S X znr {In ty + NN (Xn |My, De) } - (15.26) 


n=1k=1 


Comparison with the log likelihood function (15.13) for the incomplete data shows 
that the summation over k and the logarithm have been interchanged. The loga- 
rithm now acts directly on the Gaussian distribution, which itself is a member of 
the exponential family. Not surprisingly, this leads to a much simpler solution to the 
maximum likelihood problem, as we now show. Consider first the maximization with 
respect to the means and covariances. Because z,, is a k-dimensional vector with all 
elements equal to 0 except for a single element having the value 1, the complete-data 
log likelihood function is simply a sum of K independent contributions, one for each 
mixture component. Thus, the maximization with respect to a mean or a covariance 
is exactly as for a single Gaussian, except that it involves only the subset of data 
points that are ‘assigned’ to that component. For the maximization with respect to 
the mixing coefficients, note that these are coupled for different values of k by virtue 
of the summation constraint (15.8). Again, this can be enforced using a Lagrange 
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multiplier as before, which leads to the result 


1 N 
m= `> Bap (15.27) 


n=1 


so that the mixing coefficients are equal to the fractions of data points assigned to 
the corresponding components. 

Thus, we see that the complete-data log likelihood function can be maximized 
trivially in closed form. In practice, however, we do not have values for the latent 
variables. Therefore, as discussed earlier, we consider the expectation, with respect 
to the posterior distribution of the latent variables, of the complete-data log like- 
lihood. Using (15.9) and (15.10) together with Bayes’ theorem, we see that this 
posterior distribution takes the form 


N K 
p(Z|X, p, ©, r) o || | [mN E&E). (15.28) 


We see that this factorizes over n so that under the posterior distribution, the {Zn } are 
independent. This is easily verified by inspecting the directed graph in Figure 15.9 
and making use of the d-separation criterion. The expected value of the indicator 
variable z,,, under this posterior distribution is then given by 


5 Žnk [i The N (Xn| Hg, Dy)" 


Zn 


S TiN (Xn p43, £3)" 


Zn 


Efznk] 


TN Goal) 
Saw (Xn|Mj, X J) 


which is just the responsibility of component k for data point xn. The expected value 
of the complete-data log likelihood function is therefore given by 


= Y(Znk); (15.29) 


N K 
Ez (In p(X, Ziu, £, T)] = X. YS Y(n) {Inte +N (Xn|py, Be)}- (15.30) 


n=1k=1 


We can now proceed as follows. First we choose some initial values for the param- 
eters u”, O°", and m”, and we use these to evaluate the responsibilities (the E 
step). We then keep the responsibilities fixed and maximize (15.30) with respect to 
Hg» Ur, and Tg (the M step). This leads to closed-form solutions for uw”, SSY, 
and m”°™ given by (15.16), (15.18), and (15.21) as before. This is precisely the 
EM algorithm for Gaussian mixtures as derived earlier. We will gain more insight 
into the role of the expected complete-data log likelihood function when discuss the 
convergence of the EM algorithm in Section 15.4. 


480 


15. DISCRETE LATENT VARIABLES 


Figure 15.11 The probabilistic graphical model for se- 
quential data corresponding to a hid- (1) +: (an) 
den Markov model. The discrete latent 
variables are no longer independent but 
form a Markov chain. E © © 


Throughout this chapter we assume that the data observations are i.i.d. For or- 
dered observations that form a sequence, the mixture model can be extended by con- 
necting the latent variables in a Markov chain to give a hidden Markov model whose 
graphical structure is shown in Figure 15.11. The EM algorithm can be extended 
to this more complex model in which the E step involves a sequential calculation in 
which messages are passed along the chain of latent variables (Bishop, 2006). 


15.3.2 Relation to K-means 


Comparison of the K-means algorithm with the EM algorithm for Gaussian 
mixtures shows that there is a close similarity. Whereas the K-means algorithm 
performs a hard assignment of data points to clusters in which each data point is 
associated uniquely with one cluster, the EM algorithm makes a soft assignment 
based on the posterior probabilities. In fact, we can derive the K-means algorithm 
as a particular limit of EM for Gaussian mixtures as follows. 

Consider a Gaussian mixture model in which the covariance matrices of the 
mixture components are given by eI, where e€ is a variance parameter that is shared 
by all the components, and I is the identity matrix, so that 


1 1 
Plain Sa) = goon -l-h asso 


We now consider the EM algorithm for a mixture of K Gaussians of this form in 
which we treat € as a fixed constant, instead of a parameter to be re-estimated. From 
(15.12) the posterior probabilities, or responsibilities, for a particular data point Xn 
are given by 
mr exp {=||Xn — Mell? /2e 

X; T; exp {= [ln — Hyll? /2e} 
Consider the limit €e + 0. The denominator consists of a sum of terms indexed by j 
each of which goes to zero. The particular term for which ||x,, — y4,||? is smallest, 
say j = l, will go to zero most slowly and will then dominate this sum. Therefore, 
the responsibilities y(z,,) for the data point x,, all go to zero except for term l, for 
which the responsibility y(znı1) will go to unity. Note that this holds independently 
of the values of the my so long as none of the mẹ is zero. Thus, in this limit, we obtain 
a hard assignment of data points to clusters, just as in the k-means algorithm, so that 
Y(Znk) — Tnk Where rng is defined by (15.2). Each data point is thereby assigned to 
the cluster having the closest mean. The EM re-estimation equation for the jz,,, given 
by (15.16), then reduces to the K-means result (15.4). Note that the re-estimation 
formula for the mixing coefficients (15.21) simply resets the value of 7; to be equal 


(15.32) 


¥(Znk) = 
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to the fraction of data points assigned to cluster k, although these parameters no 
longer play an active role in the algorithm. 
Finally, in the limit €e — 0, the expected complete-data log likelihood, given by 
Exercise 15.12 (15.30), becomes 


K 


N 
1 
Ez[n p(X, Zu, £, 7)] > -3 dD melin — p, ||? + const. (15.33) 


Thus, we see that in this limit, maximizing the expected complete-data log likelihood 
is equivalent to minimizing the error measure J for the K-means algorithm given by 
(15.1). Note that the K-means algorithm does not estimate the covariances of the 
clusters but only the cluster means. 


15.3.3 Mixtures of Bernoulli distributions 


So far in this chapter, we have focused on distributions over continuous variables 
described by mixtures of Gaussians. As a further example of mixture modelling and 
to illustrate the EM algorithm in a different context, we now discuss mixtures of 
discrete binary variables described by Bernoulli distributions. This model is also 
known as latent class analysis (Lazarsfeld and Henry, 1968; McLachlan and Peel, 
2000). 

Consider a set of D binary variables x;, where i = 1,..., D, each of which is 

Section 3.1.1 governed by a Bernoulli distribution with parameter ju;, so that 


p(x|p) = Ta (1 — m) (15.34) 


where x = (%1,...,2p)? and p = (m,..., pp)". We see that the individual 
variables x; are independent, given yz. The mean and covariance of this distribution 
Exercise 15.13 are easily seen to be 


Ex] = p (15.35) 
cov[x] = diag{p;(1— m;)}. (15.36) 


Now let us consider a finite mixture of these distributions given by 


p(x|p, 7 = Sm x|My) (15.37) 


where u = {4;,... Hg} T ={m,...,7K}, and 


D 
pler) = [a — pe). (15.38) 


i=1 


The mixing coefficients satisfy (15.7) and (15.8). The mean and covariance of this 
Exercise 15.14 mixture distribution are given by 
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K 
E[x] = So meg (15.39) 
K 
cov[x] = So mp {De + uni} — E[XJE[x]” (15.40) 
k=1 
where ©, = diag {uki(l — Hki)}. Because the covariance matrix cov[x] is no 


longer diagonal, the mixture distribution can capture correlations between the vari- 
ables, unlike a single Bernoulli distribution. 

If we are given a data set X = {x1,...,x} then the log likelihood function 
for this model is given by 


In p(X| ps, T) ai [5 TEP(Xn| My) i. (15.41) 
n=1 


Again we see the appearance of the summation inside the logarithm, so that the 
maximum likelihood solution no longer has closed form. 

We now derive the EM algorithm for maximizing the likelihood function for the 
mixture of Bernoulli distributions. To do this, we first introduce an explicit discrete 
latent variable z associated with each instance of x. As with the Gaussian mixture, 
z has a 1-of-K coding so that z = (z1, ..., zx)" is a binary K-dimensional vector 
having a single component equal to 1, with all other components equal to 0. We can 
then write the conditional distribution of x, given the latent variable, as 


p(x|z, u) -fv (xļ|up) (15.42) 


whereas the prior distribution for the latent variables is the same as for the mixture- 
of-Gaussians model, so that 


plz\r) = II ae (15.43) 


If we form the product of p(x|z, p) and p(z|7r) and then marginalize over z, then we 
recover (15.37). 

To derive the EM algorithm, we first write down the complete-data log likelihood 
function, which is given by 


N K 
In p(X, Z|, m) = yy Znk {mm 


n=1k=1 


D 
+ y, [eni ln Hki + (1 — £ni) ln(1 — mal} (15.44) 


i=1 
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where X = {x,,} and Z = {z,,}. Next we take the expectation of the complete-data 
log likelihood with respect to the posterior distribution of the latent variables to give 


N K 
Ez [In p(X, Z|u,7)] = X. 
n=1k 


V(Znk) fia Tk 
1 


i=1 


D 
+ D [eni ln Hki + (1 — £ni) ln(1 — mal} (15.45) 


where ¥(Znx%) = E[Znx] is the posterior probability, or responsibility, of component 
k given data point xn. In the E step, these responsibilities are evaluated using Bayes’ 


theorem, which takes the form 
> nk II [Te P(Xn lea)” 


Zn k’ 
> [I [menle] 
Zn j 


TkP(Xn| Hp) 


= = i 
X mjp(xnle;) 
j=1 


Y(Znk) = Elen] 


(15.46) 


If we consider the sum over n in (15.45), we see that the responsibilities enter 
only through two terms, which can be written as 


N 

Ne = X Wen) (15.47) 
n=1 
1 & 

Xk = N; 2 18% (15.48) 


where NN; is the effective number of data points associated with component k. In the 
M step, we maximize the expected complete-data log likelihood with respect to the 
parameters u, and m. If we set the derivative of (15.45) with respect to u, equal to 
zero and rearrange the terms, we obtain 


by = Fp. (15.49) 


We see that this sets the mean of component k equal to a weighted mean of the 
data, with weighting coefficients given by the responsibilities that component k takes 
for each of the data points. For the maximization with respect to mk, we need to 
introduce a Lagrange multiplier to enforce the constraint }>, 7, = 1. Following 
analogous steps to those used for the mixture of Gaussians, we then obtain 


, (15.50) 
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Figure 15.12 


Illustration of the Bernoulli mixture model in which the top row shows examples from the digits 


data set after converting the pixel values from grey scale to binary using a threshold of 0.5. On the bottom row 
the first three images show the parameters upi for each of the three components in the mixture model. As a 
comparison, we also fit the same data set using a single multivariate Bernoulli distribution, again using maximum 
likelihood. This amounts to simply averaging the counts in each pixel and is shown by the right-most image on 


the bottom row. 
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which represents the intuitively reasonable result that the mixing coefficient for com- 
ponent k is given by the effective fraction of points in the data set explained by that 
component. 

Note that in contrast to the mixture of Gaussians, there are no singularities in 
which the likelihood function goes to infinity. This can be seen by noting that the 
likelihood function is bounded above because 0 < p(x,|f,,) < 1. There exist so- 
lutions for which the likelihood function is zero, but these will not be found by EM 
provided it is not initialized to a pathological starting point, because the EM algo- 
rithm always increases the value of the likelihood function, until a local maximum 
is found. 

We illustrate the Bernoulli mixture model in Figure 15.12 by using it to model 
handwritten digits. Here the digit images have been turned into binary vectors by 
setting all elements whose values exceed 0.5 to 1 and setting the remaining elements 
to 0. We now fit a data set of N = 600 such digits, comprising the digits ‘2’, ‘3’, 
and ‘4’, with a mixture of K = 3 Bernoulli distributions by running 10 iterations of 
the EM algorithm. The mixing coefficients were initialized to 7, = 1/K, and the 
parameters 4g; were set to random values chosen uniformly in the range (0.25, 0.75) 
and then normalized to satisfy the constraint that $`; 4x; = 1. We see that a mix- 
ture of three Bernoulli distributions is able to find the three clusters in the data set 
corresponding to the different digits. It is straightforward to extend the analysis of 
Bernoulli mixtures to the case of multinomial binary variables having M > 2 states 
by making use of the discrete distribution (3.14). 
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15.4. Evidence Lower Bound 


Appendix B 


Appendix B 


Exercise 15.21 


We now present an even more general perspective on the EM algorithm by deriving 
a lower bound on the log likelihood function, which is known as the evidence lower 
bound or ELBO. It is sometimes called a variational lower bound. Here the term 
evidence refers to the (log) likelihood function, which is sometimes called the ‘model 
evidence’ in a Bayesian setting as it allows different models to be compared without 
the use of hold-out data (Bishop, 2006). As an illustration of this bound, we use it 
to re-derive the EM algorithm for Gaussian mixtures from a third perspective. The 
ELBO will play an important role in several of the deep generative models discussed 
in later chapters. It also provides an example of a variational framework in which we 
introduce a distribution q(Z) over the latent variables and then optimize with respect 
to this distribution using the calculus of variations. 

Consider a probabilistic model in which we collectively denote all the observed 
variables by X and all the hidden variables by Z. The joint distribution p(X, Z|@) is 
governed by a set of parameters denoted by 0. Our goal is to maximize the likelihood 
function: 


p(X|A) = X` p(X, Z0). (15.51) 
Z 


Here we are assuming that Z is discrete, although the discussion is identical if Z 
comprises continuous variables or a combination of discrete and continuous vari- 
ables, with summation replaced by integration as appropriate. 

We will suppose that direct optimization of p(X|@) is difficult, but that optimiza- 
tion of the complete-data likelihood function p(X, Z|@) is significantly easier. Next 
we introduce a distribution g(Z) defined over the latent variables, and we observe 
that, for any choice of q(Z), the following decomposition holds: 


In p(X|@) = L(q, 0) + KL(q||p) (15.52) 

where we have defined 
L(q,0) = azn (2E (15.53) 
KL(q||p) = -Dazn PERA, (15.54) 


Note that £(q, 0) is a functional of the distribution q(Z) and a function of the pa- 
rameters 0. It is worth studying carefully the forms of the expressions (15.53) and 
(15.54), and in particular noting that they differ in sign and also that £ (q, 0) contains 
the joint distribution of X and Z whereas KL(q||p) contains the conditional distri- 
bution of Z given X. To verify the decomposition (15.52), we first make use of the 
product rule of probability to give 


In p(X, Z|@) = Inp(Z|X, 0) + In p(X]0), (15.55) 


486 15. DISCRETE LATENT VARIABLES 


Figure 15.13 


Section 2.5.7 


Figure 15.14 Illustration of the E step of the KL(q||p) = 0 
EM algorithm. The q distribution is set equal to 

the posterior distribution for the current parame- 

ter values 9°", causing the lower bound to move 

up to the same value as the log likelinood func- 

tion, with the KL divergence vanishing. 


Illustration of the decomposition 
given by (15.52), which holds for 
any choice of distribution q(Z). 
Because the Kullback—Leibler di- 
vergence satisfies KL(q||p) > 0, 
we see that the quantity L(q, 0) is 
a lower bound on the log likelihood 
function In p(X|@). 


which we then substitute into the expression for £(q, 0). This gives rise to two terms, 
one of which cancels KL(q||p) whereas the other gives the required log likelihood 
In p(X|@) after noting that g(Z) is a normalized distribution that sums to 1. 

From (15.54), we see that KL(q||p) is the Kullback—Leibler divergence between 
q(Z) and the posterior distribution p(Z|X, 0). Recall that the Kullback—Leibler di- 
vergence satisfies KL(q||p) > 0, with equality if, and only if, ¢(Z) = p(Z|X, 0). It 
therefore follows from (15.52) that £L(q, 0) < In p(X|@), in other words that £ (q, 0) 
is a lower bound on In p(X|@). The decomposition (15.52) is illustrated in Fig- 
ure 15.13. 


15.4.1 EM revisited 


We can use the decomposition (15.52) to derive the EM algorithm and to demon- 
strate that it does indeed maximize the log likelihood. Suppose that the current value 
of the parameter vector is 0°". In the E step, the lower bound £(q, 0°") is maxi- 
mized with respect to q(Z) while holding 0°" fixed. The solution to this maximiza- 
tion problem is easily seen by noting that the value of In p(X|0°'") does not depend 
on q(Z) and so the largest value of L(q, 0°") will occur when the Kullback—Leibler 
divergence vanishes, in other words when q(Z) is equal to the posterior distribu- 
tion p(Z|X, 0°"). In this case, the lower bound will equal the log likelihood, as 
illustrated in Figure 15.14. 

In the subsequent M step, the distribution ¢(Z) is held fixed and the lower bound 


Inp(X|6°") 


Figure 15.15 
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Illustration of the M step of the 
EM algorithm. The distribu- KL(q||p) 
tion q(Z) is held fixed and the 
lower bound £(q,0) is maxi- 
mized with respect to the pa- 
rameter vector @ to give a re- 
vised value 6"°“. Because the 
Kullback—Leibler divergence is 
non-negative, this causes the log 
likelihood In p(X|@) to increase 
by at least as much as the lower 
bound does. 


L(q, In p(X|e"™) 


L(q,9) is maximized with respect to 0 to give some new value 0°”. This will 
cause the lower bound £ to increase (unless it is already at a maximum), which will 
necessarily cause the corresponding log likelihood function to increase. Because the 
distribution q is determined using the old parameter values rather than the new values 
and is held fixed during the M step, it will not equal the new posterior distribution 


p(Z|X, 8"*”), and hence there will be a non-zero Kullback—Leibler divergence. The 


increase in the log likelihood function is therefore greater than the increase in the 
lower bound, as shown in Figure 15.15. If we substitute q(Z) = p(Z|X, 0°") into 
(15.53), we see that, after the E step, the lower bound takes the form 


L(q,9) = X` p(Z|X, 0) In p(X, Z|@) — X` p(Z|X, 0") In p(Z|X, 6") 
Z Z 
= Q(0,0°") + const (15.56) 


where the constant is simply the negative oer) of the q distribution and is therefore 
independent of 0. Here we recognize Q(0, 0°) as the expected complete-data log- 
likelihood defined by (15.23), and it is ee the quantity that is being maximized 
in the M step, as we saw earlier distribution for mixtures of Gaussians. Note that 
the variable 0 over which we are optimizing appears only inside the logarithm. If 
the joint distribution p(Z, X|@) is a member of the exponential family or a product 
of such members, then we see that the logarithm will cancel the exponential and 
lead to an M step that will be typically much simpler than the maximization of the 
corresponding incomplete-data log likelihood function p(X|@). 

The operation of the EM algorithm can also be viewed in the space of param- 
eters, as illustrated schematically in Figure 15.16. Here the red curve depicts the 
(incomplete-data) log likelihood function whose value we wish to maximize. We 
start with some initial parameter value 6°", and in the first E step we evaluate the 
posterior distribution over latent variables, which gives rise to a lower bound L(q, 8) 
whose value equals the log likelihood at 9°!) as shown by the blue curve. Note that 
the bound makes a tangential contact with the log likelihood at 6°) | so that both 
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Figure 15.16 The EM algorithm involves al- 


Exercise 15.22 


ternately computing a lower 
bound on the log likelihood 
for the current parameter val- 
ues and then maximizing this 
bound to obtain the new pa- 
rameter values. See the text 
for a full discussion. 


LO, geld) 


geld) g (new) 


curves have the same gradient. This bound is a convex function having a unique 
maximum (for mixture components from the exponential family). In the M step, 
the bound is maximized giving the value 6) | which gives a larger value of the 
log likelihood than gD The subsequent E step then constructs a bound that is 
tangential at @°"°”? as shown by the green curve. 

We have seen that both the E and the M steps of the EM algorithm increase the 
value of a well-defined bound on the log likelihood function and that the complete 
EM cycle will change the model parameters in such a way as to cause the log like- 
lihood to increase (unless it is already at a maximum, in which case the parameters 
remain unchanged). 


15.4.2 Independent and identically distributed data 


For the particular case of an ii.d. data set, X will comprise N data points 
{Xn} whereas Z will comprise N corresponding latent variables {z,,}, where n = 
1,...,.N. From the independence assumption, we have p(X, Z) = [],, pP(Xn,Zn), 
and by marginalizing over the {z,,} we have p(X) = [],, p(x,). Using the sum 
and product rules, we see that the posterior probability that is evaluated in the E step 
takes the form 


N 
II P(Xn, Zn|0) N 
p(X, ZO) nay = [[ p@nlxn.0) 15.57) 


X,Z|0 A nai 
Dv 18) S [I pn. 2nl9) 


n=1 


p(Z|X, 0) = 


and so the posterior distribution also factorizes with respect to n. For a Gaussian 
mixture model, this simply says that the responsibility that each of the mixture com- 
ponents takes for a particular data point x,, depends only on the value of x, and 
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on the parameters @ of the mixture components, not on the values of the other data 
points. 


15.4.3 Parameter priors 


We can also use the EM algorithm to maximize the posterior distribution p(0|X) 
for models in which we have introduced a prior p(@) over the parameters. To see this, 
note that as a function of 0, we have p(@|X) = p(@, X)/p(X) and so 


In p(@|X) = In p(O, X) — ln p(X). (15.58) 
Making use of the decomposition (15.52), we have 


Inp(6|X) = L(q,@) + KL(q||p) + Inp(@) — np(X) 
> L(q,0) + Inp(O@) — Inp(X) (15.59) 


where In p(X) is a constant. We can again optimize the right-hand side alternately 
with respect to q and 0. The optimization with respect to q gives rise to the same E- 
step equations as for the standard EM algorithm, because q appears only in £(q, 0). 
The M-step equations are modified through the introduction of the prior term In p(0), 
which typically requires only a small modification to the standard maximum likeli- 
hood M-step equations. The additional term represents a form of regularization and 
has the effect of removing the singularities of the likelihood function for Gaussian 
mixture models. 


15.4.4 Generalized EM 


The EM algorithm breaks down the potentially difficult problem of maximizing 
the likelihood function into two stages, the E step and the M step, each of which will 
often prove simpler to implement. Nevertheless, for complex models it may be the 
case that either the E step or the M step, or indeed both, remain intractable. This 
leads to two possible extensions of the EM algorithm, as follows. 

The generalized EM, or GEM, algorithm addresses the problem of an intractable 
M step. Instead of aiming to maximize £(q, 0) with respect to 0, it seeks instead to 
change the parameters in such a way as to increase its value. Again, because £(q, 0) 
is a lower bound on the log likelihood function, each complete EM cycle of the 
GEM algorithm is guaranteed to increase the value of the log likelihood (unless the 
parameters already correspond to a local maximum). One way to exploit the GEM 
approach would be to use gradient-based iterative optimization algorithms during 
the M step. Another form of GEM algorithm, known as the expectation conditional 
maximization algorithm, involves making several constrained optimizations within 
each M step (Meng and Rubin, 1993). For instance, the parameters might be par- 
titioned into groups and the M step broken down into multiple steps each of which 
involves optimizing one of the groups with the remainder held fixed. 

We can similarly generalize the E step of the EM algorithm by performing a 
partial, rather than complete, optimization of £(q,@) with respect to g(Z) (Neal and 
Hinton, 1999). As we have seen, for any given value of 0 there is a unique maximum 
of L(q, 0) with respect to q(Z) that corresponds to the posterior distribution qọ (Z) = 
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Exercise 15.23 


Exercises 
15.1 


p(Z|X, 0) and that for this choice of q(Z), the bound £(q,0) is equal to the log 
likelihood function In p(X|@). It follows that any algorithm that converges to the 
global maximum of £(q, 0) will find a value of @ that is also a global maximum 
of the log likelihood In p(X|@). Provided p(X, Z|@) is a continuous function of 0 
then, by continuity, any local maximum of £(q, 0) will also be a local maximum of 
In p(X|@). 


15.4.5 Sequential EM 


Consider N independent data points x;,..., Xn with corresponding latent vari- 
ables z1,...,Zy. The joint distribution p(X, Z|@) factorizes over the data points, 
and this structure can be exploited in an incremental form of EM in which at each 
EM cycle, only one data point is processed at a time. In the E step, instead of recom- 
puting the responsibilities for all the data points, we just re-evaluate the responsibil- 
ities for one data point. It might appear that the subsequent M step would require 
a computation involving the responsibilities for all the data points. However, if the 
mixture components are members of the exponential family, then the responsibilities 
enter only through simple sufficient statistics, and these can be updated efficiently. 
Consider, for instance, a Gaussian mixture, and suppose we perform an update for 
data point m in which the corresponding old and new values of the responsibilities 
are denoted by 7°"(zmz) and y°’(2mx). In the M step, the required sufficient 
statistics can be updated incrementally. For instance, for the means, the sufficient 
statistics are defined by (15.16) and (15.17) from which we obtain 


new old 
new o Y Zmk)— 7 (Zmk 7 
HR = pe J ( i = ’) (xm — Me") ab) 
k 
together with 
NEY = NR +7" (zmt) — V (zmt). (15.61) 


The corresponding results for the covariances and the mixing coefficients are analo- 
gous. 

Thus, both the E step and the M step take a fixed time that is independent of 
the total number of data points. Because the parameters are revised after each data 
point, rather than waiting until after the whole data set is processed, this incremental 
version can converge faster than the batch version. Each E or M step in this incre- 
mental algorithm increases the value of £ (q, 0), and as we have shown above, if the 
algorithm converges to a local (or global) maximum of L(q, 0), this will correspond 
to a local (or global) maximum of the log likelihood function In p(X|@). 


(x) Consider the k-means algorithm discussed in Section 15.1. Show that as a con- 
sequence of there being a finite number of possible assignments for the set of discrete 
indicator variables r,,;, and that for each such assignment there is a unique optimum 
for the {1;,}, the K-means algorithm must converge after a finite number of itera- 
tions. 
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15.4 


15.5 


15.6 


15.7 


15.8 


15.9 
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(x x) In this exercise we derive the sequential form for the K-means algorithm. At 
each step we consider a new data point Xn, and only the prototype vector that is 
closest to x, is updated. Starting from the expression (15.4) for the prototype vectors 
in the batch setting, separate out the contribution from the final data point xn. By 
rearranging the formula, show that this update takes the form (15.5). Note that, since 
no approximation is made in this derivation, the resulting prototype vectors will have 
the property that they each equal the mean of all the data vectors that were assigned 
to them. 


(x) Consider a Gaussian mixture model in which the marginal distribution p(z) for 
the latent variable is given by (15.9) and the conditional distribution p(x|z) for the 
observed variable is given by (15.10). Show that the marginal distribution p(x), 
obtained by summing p(z)p(x|z) over all possible values of z, is a Gaussian mixture 
of the form (15.6). 


(x) Show that the number of equivalent parameter settings due to interchange sym- 
metries in a mixture model with K components is K!. 


(x x) Suppose we wish to use the EM algorithm to maximize the posterior distri- 
bution over parameters p(@|X) for a model containing latent variables, where X is 
the observed data set. Show that the E step remains the same as in the maximum 
likelihood case, whereas in the M step the quantity to be maximized is given by 
Q(0, 0%) + In p(@) where Q(0, 0°") is defined by (15.23). 


(x) Consider the directed graph for a Gaussian mixture model shown in Figure 15.9. 
By making use of the d-separation criterion, show that the posterior distribution of 
the latent variables factorizes with respect to the different data points so that 


N 
p(Z|X, p, £, 7) = | | penlxn, H, £, 7). (15.62) 


n=1 


(x x) Consider a special case of a Gaussian mixture model in which the covariance 
matrices X% of the components are all constrained to have a common value X. De- 
rive the EM equations for maximizing the likelihood function under such a model. 


(xx) Verify that maximization of the complete-data log likelihood (15.26) for a 
Gaussian mixture model leads to the result that the means and covariances of each 
component are fitted independently to the corresponding group of data points and 
that the mixing coefficients are given by the fractions of points in each group. 


(x x) Show that if we maximize (15.30) with respect to u, while keeping the respon- 
sibilities y(z,,) fixed, we obtain the closed-form solution given by (15.16). 


(x x) Show that if we maximize (15.30) with respect to Xs; and mg while keeping the 
responsibilities y(znx) fixed, we obtain the closed-form solutions given by (15.18) 
and (15.21). 
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15.11 


15.12 


15.13 


15.14 


15.15 


15.16 


(x x) Consider a density model given by a mixture distribution: 


K 
p(x) = X tep(x|k) (15.63) 
k=1 
and suppose that we partition the vector x into two parts so that x = (Xa, Xb). 


Show that the conditional density p(x,|xq) is itself a mixture distribution, and find 
expressions for the mixing coefficients and for the component densities. 


(x) In Section 15.3.2, we obtained a relationship between K means and EM for 
Gaussian mixtures by considering a mixture model in which all components have 
covariance eI. Show that in the limit e + 0, maximizing the expected complete-data 
log likelihood for this model, given by (15.30), is equivalent to minimizing the error 
measure J for the k-means algorithm given by (15.1). 


(x x) Verify the results (15.35) and (15.36) for the mean and covariance of the Bernoulli 
distribution. 


(x x) Consider a mixture distribution of the form 


K 
p(x) = X` mep(x|k) (15.64) 
k=1 


where the elements of x could be discrete or continuous or a combination of these. 
Denote the mean and covariance of p(x|k) by uy and €x, respectively. By making 
use of the results of Exercise 15.13, show that the mean and covariance of the mixture 
distribution are given by (15.39) and (15.40). 


(x x) Using the re-estimation equations for the EM algorithm, show that a mixture 
of Bernoulli distributions, with its parameters set to values corresponding to a maxi- 
mum of the likelihood function, has the property that 


N 


E[x] = 2 So Xn =X. (15.65) 


n=l 


Hence, show that if the parameters of this model are initialized such that all compo- 
nents have the same mean p, = f for k = 1,..., K, then the EM algorithm will 
converge after one iteration, for any choice of the initial mixing coefficients, and that 
this solution has the property jz;, = X. Note that this represents a degenerate case of 
the mixture model in which all the components are identical, and in practice we try 
to avoid such solutions by using an appropriate initialization. 


(x) Consider the joint distribution of latent and observed variables for the Bernoulli 
distribution obtained by forming the product of p(x|z,w) given by (15.42) and 
p(z|m) given by (15.43). Show that if we marginalize this joint distribution with 
respect to z, then we obtain (15.37). 
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15.18 


15.19 


15.20 


15.21 


15.22 


15.23 


15.24 
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(x) Show that if we maximize the expected complete-data log likelihood function 
(15.45) for a mixture of Bernoulli distributions with respect to pẹ, we obtain the 
M-step equation (15.49). 


(x) Show that if we maximize the expected complete-data log likelihood function 
(15.45) for a mixture of Bernoulli distributions with respect to the mixing coefficients 
Tk, and use a Lagrange multiplier to enforce the summation constraint, we obtain the 
M-step equation (15.50). 


(x) Show that as a consequence of the constraint 0 < p(Xn|Hy) < 1 for the discrete 
variable x,,, the incomplete-data log likelihood function for a mixture of Bernoulli 
distributions is bounded above and hence that there are no singularities for which the 
likelihood goes to infinity. 


(x x x) Consider a D-dimensional variable x each of whose components i is itself a 
multinomial variable of degree M so that x is a binary vector with components £i; 
where? = 1,..., D and j =1,..., M, subject to the constraint that 5 xi; = 1 for 
all ¿. Suppose that the distribution of these variables is described by a mixture of the 
discrete multinomial distributions so that 


= X rep(x| My) (15.66) 
where 

DM 
plu) = | [I [eż (15.67) 

$=] 451 
The o LUkij represent the probabilities p(x;; = 1|) and must satisfy 
0 < Hpij < 1 together with the constraint 2a Ukij = 1 for all values of k and i. 
Given an observed data set {x,,}, where n = 1,..., N, derive the E-step and M-step 


equations of the EM algorithm for pinia the mixing coefficients 7; and the 
component parameters Hkij of this distribution by maximum likelihood. 


(x) Verify the relation (15.52) in which £(q, 0) and KL(gq||p) are defined by (15.53) 
and (15.54), respectively. 


(x) Show that the lower bound £ (q, 0) given by (15.53), with q(Z) = p(Z|X, 0°"), 
has the same gradient with respect to @ as the log likelihood function In p(X|@) at 
the point 0 = 9°"), 


(x x) Consider the incremental form of the EM algorithm for a mixture of Gaussians, 
in which the responsibilities are recomputed only for a specific data point Xm. Start- 
ing from the M-step formulae (15.16) and (15.17), derive the results (15.60) and 
(15.61) for updating the component means. 


(x x) Derive M-step formulae for updating the covariance matrices and mixing co- 
efficients in a Gaussian mixture model when the responsibilities are updated incre- 
mentally, analogous to the result (15.60) for updating the means. 
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Check for 
updates 


16 


Continuous 
Latent Variables 


In the previous chapter we discussed probabilistic models having discrete latent vari- 
ables, such as a mixture of Gaussians. We now explore models in which some, or 
all, of the latent variables are continuous. An important motivation for such models 
is that many data sets have the property that the data points lie close to a manifold 
of much lower dimensionality than that of the original data space. To see why this 
might arise, consider an artificial data set constructed by taking a handwritten digit 
from the MNIST data set (LeCun et al., 1998), represented by a 64 x 64 pixel grey- 
level image, and embedding it in a larger image of size 100 x 100 by padding with 
pixels having the value zero (corresponding to white pixels) in which the location and 
orientation of the digit are varied at random, as illustrated in Figure 16.1. Each of the 
resulting images is represented by a point in the 100 x 100 = 10,000-dimensional 
data space. However, across a data set of such images, there are only three degrees 
of freedom of variability, corresponding to vertical and horizontal translations and 
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Figure 16.1 A synthetic data set obtained by taking an image of a handwritten digit and creating multiple copies 
in each of which the digit has undergone a random displacement and rotation within some larger image field. 
The resulting images each have 100 x 100 = 10,000 pixels. 
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Section 16.1 


Section 16.2 


rotations. The data points will therefore live on a subspace of the data space whose 
intrinsic dimensionality is three. Note that the manifold will be nonlinear because, 
for instance, if we translate the digit past a particular pixel, that pixel value will go 
from zero (white) to one (black) and back to zero again, which is clearly a nonlinear 
function of the digit position. In this example, the translation and rotation parame- 
ters are latent variables because we observe only the image vectors and are not told 
which values of the translation or rotation variables were used to create them. 

For real data sets of handwritten digits, there will be further degrees of freedom 
arising from scaling and other variations due, for example, to the variability in an 
individual’s writing as well as the differences in writing styles between individuals. 
Nevertheless, the number of such degrees of freedom will be small compared to the 
dimensionality of the data set. 

In practice, the data points will not be confined precisely to a smooth low- 
dimensional manifold, and we can interpret the departures of data points from the 
manifold as ‘noise’. This leads naturally to a generative view of such models in 
which we first select a point within the manifold according to some latent-variable 
distribution and then generate an observed data point by adding noise drawn from 
some conditional distribution of the data variables given the latent variables. 

The simplest continuous latent-variable model assumes Gaussian distributions 
for both the latent and observed variables and makes use of a linear-Gaussian de- 
pendence of the observed variables on the state of the latent variables. This leads 
to a probabilistic formulation of the well-known technique of principal component 
analysis (PCA) as well as to a related model called factor analysis. In this chap- 
ter we will begin with a standard, non-probabilistic treatment of PCA, and then we 
show how PCA arises naturally as the maximum likelihood solution for a linear- 
Gaussian latent-variable model. This probabilistic reformulation brings many ad- 
vantages, such as the use of EM for parameter estimation, principled extensions to 
mixtures of PCA models, and Bayesian formulations that allow the number of prin- 
cipal components to be determined automatically from the data (Bishop, 2006). This 
chapter also lays the foundations for nonlinear models having continuous latent vari- 
ables including normalizing flows, variational autoencoders, and diffusion models. 


Figure 16.2 Principal component analysis seeks a 
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space of lower dimensionality, known as the 
principal subspace and denoted by the ma- 
genta line, such that the orthogonal projec- 
tion of the data points (red dots) onto this 
subspace maximizes the variance of the 
projected points (green dots). An alterna- 
tive definition of PCA is based on minimiz- 
ing the sum-of-squares of the projection er- 
rors, indicated by the blue lines. 


Principal Component Analysis 


Principal component analysis, or PCA, is widely used for applications such as di- 
mensionality reduction, lossy data compression, feature extraction, and data visual- 
ization (Jolliffe, 2002). It is also known as the Kosambi—Karhunen—Loéve transform. 

Consider the orthogonal projection of a data set onto a lower-dimensional lin- 
ear space, known as the principal subspace, as shown in Figure 16.2. PCA can be 
defined as the linear projection that maximizes the variance of the projected data 
(Hotelling, 1933). Equivalently, it can be defined as the linear projection that min- 
imizes the average projection cost, defined as the mean squared distance between 
the data points and their projections (Pearson, 1901). We consider each of these 
definitions in turn. 


16.1.1 Maximum variance formulation 


Consider a data set of observations {xn} where n = 1,..., N, and x, is a 
Euclidean variable with dimensionality D. Our goal is to project the data onto a 
space having dimensionality M < D while maximizing the variance of the projected 
data. For the moment, we will assume that the value of M is given. Later in this 
chapter, we will consider techniques to determine an appropriate value of M from 
the data. 

To begin with, consider the projection onto a one-dimensional space (M = 1). 
We can define the direction of this space using a D-dimensional vector u, which 
for convenience (and without loss of generality) we will choose to be a unit vector 
so that ulu, = 1 (note that we are interested only in the direction defined by u4, 
not in the magnitude of u; itself). Each data point x,, is then projected onto a scalar 
value uj x,,. The mean of the projected data is ufX where X is the sample set mean 
given by 


1 N 
x= N 2% (16.1) 
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and the variance of the projected data is given by 


N 
1 
WN `> {up xn — ulz} = u7 Su; (16.2) 
n=1 
where S is the data covariance matrix defined by 
1a 
S=5 dn —x)(x, =x)". (16.3) 


We now maximize the projected variance uj Su, with respect to u1. Clearly, this has 
to be a constrained maximization to prevent ||u1|| + oo. The appropriate constraint 
comes from the normalization condition ufu; = 1. To enforce this constraint, 
we introduce a Lagrange multiplier that we will denote by A, and then make an 


unconstrained maximization of 
ul Su; + A; (1 — ufu). (16.4) 


By setting the derivative with respect to u; equal to zero, we see that this quantity 
will have a stationary point when 


Su, = Ayu), (16.5) 


which says that u; must be an eigenvector of S. If we left-multiply by uf and make 
use of uj u; = 1, we see that the variance is given by 


u, Su; = A; (16.6) 


and so the variance will be a maximum when we set u, equal to the eigenvector 
having the largest eigenvalue Aı. This eigenvector is known as the first principal 
component. 

We can define additional principal components in an incremental fashion by 
choosing each new direction to be that which maximizes the projected variance 
amongst all possible directions orthogonal to those already considered. If we con- 
sider the general case of an 1/-dimensional projection space, the optimal linear pro- 
jection for which the variance of the projected data is maximized is now defined by 
the M eigenvectors u;,..., um of the data covariance matrix S corresponding to the 
M largest eigenvalues A1, ..., Am. This is easily shown using proof by induction. 

To summarize, PCA involves evaluating the mean X and the covariance matrix 
S of a data set and then finding the M eigenvectors of S corresponding to the M 
largest eigenvalues. Algorithms for finding eigenvectors and eigenvalues, as well as 
additional theorems related to eigenvector decomposition, can be found in Golub and 
Van Loan (1996). Note that the computational cost of computing the full eigenvector 
decomposition for a matrix of size D x D is O(D*). If we plan to project our 
data onto the first M principal components, then we only need to find the first M 
eigenvalues and eigenvectors. This can be done with more efficient techniques, such 
as the power method (Golub and Van Loan, 1996), that scale like O(MD?), or 
alternatively we can make use of the EM algorithm. 
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16.1.2 Minimum-error formulation 


We now discuss an alternative formulation of PCA based on projection error 
minimization. To do this, we introduce a complete orthonormal set of D-dimensional 
basis vectors {u;} where i = 1,..., D that satisfy 


ur uj = 6i;. (16.7) 


Because this basis is complete, each data point can be represented exactly by a linear 
combination of the basis vectors 


D 
Xn = `> Ani; (16.8) 
i=1 


where the coefficients a,,; will be different for different data points. This simply 
corresponds to a rotation of the coordinate system to a new system defined by the 
{u;}, and the original D components {£n1, . - - , &np } are replaced by an equivalent 
set {Qn1,---,Qnp}. Taking the inner product with u;, and making use of the or- 
thonormality property, we obtain anj = x,,U;, and so without loss of generality we 
can write 


Xn = D (x/ uj) Uj. (16.9) 
i=l 
Our goal, however, is to approximate this data point using a representation in- 
volving a restricted number M < D of variables corresponding to a projection onto 
a lower-dimensional subspace. The M-dimensional linear subspace can be repre- 
sented, without loss of generality, by the first M of the basis vectors, and so we 
approximate each data point x, by 


M D 
Xn = `> ZniUy + biu; (16.10) 
i=1 i=M+1 


where the {zni} depend on the particular data point, whereas the {b;} are constants 
that are the same for all data points. We are free to choose the {u;}, the {zni}, and 
the {b;} so as to minimize the error introduced by the reduction in dimensionality. 
As our error measure, we will use the squared distance between the original data 
point x,, and its approximation x,,, averaged over the data set, so that our goal is to 
minimize 


1 & 
= z 2 
J= ND [Xn — Xnll°- (16.11) 


Consider first the minimization with respect to the quantities {zni}. Substituting 
for Xn, setting the derivative with respect to Znj to zero, and making use of the 
orthonormality conditions, we obtain 


Znj =, Uy (16.12) 
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where j = 1,..., M. Similarly, setting the derivative of J with respect to b; to zero 
and again making use of the orthonormality relations, gives 


bj =x" u,; (16.13) 
where 7 = M +1,..., D. If we substitute for zn; and b; and make use of the general 
expansion (16.9), we obtain 

D 
Xn—Kn= SY) {x — x)" uy} u; (16.14) 
i=M+41 


from which we see that the displacement vector from Xn to Xn lies in the space 
orthogonal to the principal subspace, because it is a linear combination of {u; } for 
i = M+1,...,D, as illustrated in Figure 16.2. This is to be expected because the 
projected points X„ must lie within the principal subspace, but we can move them 
freely within that subspace, and so the minimum error is given by the orthogonal 
projection. 

We therefore obtain an expression for the error measure J as a function purely 
of the {u;} in the form 


1 N D D 
J=5 Y Y (u-u) = Y ul Su. (16.15) 
n=li=M+1 i=M+1 


There remains the task of minimizing J with respect to the {u;}, which must be 
a constrained minimization otherwise we will obtain the vacuous result u; = 0. The 
constraints arise from the orthonormality conditions, and as we will see, the solution 
will be expressed in terms of the eigenvector expansion of the covariance matrix. 
Before considering a formal solution, let us try to obtain some intuition about the 
result by considering a two-dimensional data space D = 2 and a one-dimensional 
principal subspace M = 1. We have to choose a direction uz so as to minimize 
J = u}Sup, subject to the normalization constraint ud uz = 1. Using a Lagrange 
multiplier Aj to enforce the constraint, we consider the minimization of 


J = us Su + Az (1 — ufus) . (16.16) 


Setting the derivative with respect to uy to zero, we obtain Sug = Azu so that 
Uy is an eigenvector of S with eigenvalue A>. Thus, any eigenvector will define 
a Stationary point of the error measure. To find the value of J at the minimum, 
we back-substitute the solution for uz into the error measure to give J = A2. We 
therefore obtain the minimum value of J by choosing uz to be the eigenvector corre- 
sponding to the smaller of the two eigenvalues. Thus, we should choose the principal 
subspace to be aligned with the eigenvector having the larger eigenvalue. This result 
accords with our intuition that, to minimize the average squared projection distance, 
we should choose the principal component subspace so that it passes through the 
mean of the data points and is aligned with the directions of maximum variance. If 
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the eigenvalues are equal, any choice of principal direction will give rise to the same 
value of J. 
The general solution to the minimization of J for arbitrary D and arbitrary M < 
D is obtained by choosing the {u;} to be eigenvectors of the covariance matrix given 
by 
Su; = Aju; (16.17) 


where i = 1,..., D, and as usual the eigenvectors {u;} are chosen to be orthonor- 
mal. The corresponding value of the error measure is then given by 


D 
J= `> Ai, (16.18) 


i=M+1 


which is simply the sum of the eigenvalues of those eigenvectors that are orthogonal 
to the principal subspace. We therefore obtain the minimum value of J by selecting 
these eigenvectors to be those having the D — M smallest eigenvalues, and hence 
the eigenvectors defining the principal subspace are those corresponding to the M 
largest eigenvalues. 

Although we have considered M < D, the PCA analysis still holds if M = 
D, in which case there is no dimensionality reduction but simply a rotation of the 
coordinate axes to align with the principal components. 

Finally, note that there is a related linear dimensionality reduction technique 
called canonical correlation analysis (Hotelling, 1936; Bach and Jordan, 2002). 
Whereas PCA works with a single random variable, canonical correlation analy- 
sis considers two (or more) variables and tries to find a corresponding pair of linear 
subspaces that have high cross-correlation, so that each component within one of the 
subspaces is correlated with a single component from the other subspace. Its solution 
can be expressed in terms of a generalized eigenvector problem. 


16.1.3 Data compression 


One application for PCA is data compression, and we can illustrate this by con- 
sidering a data set of images of handwritten digits. Because each eigenvector of the 
covariance matrix is a vector in the original D-dimensional space, we can represent 
the eigenvectors as images of the same size as the data points. The mean image and 
the first four eigenvectors, along with their corresponding eigenvalues, are shown in 
Figure 16.3. 

A plot of the complete spectrum of eigenvalues, sorted into decreasing order, is 
shown in Figure 16.4(a). The error measure J associated with choosing a particular 
value of M is given by the sum of the eigenvalues from M + 1 up to D and is plotted 
for different values of M in Figure 16.4(b). 

If we substitute (16.12) and (16.13) into (16.10), we can write the PCA approx- 
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Illustration of PCA applied to a data set of 6,000 images of size 28x 28, each comprising a hand- 
written image of the numeral ‘3’ , showing the mean vector x along with the first four PCA eigenvectors ui,..., u4, 
together with their corresponding eigenvalues. 


imation to a data vector x,, in the form 


M D 
Xa = X (xp u;)u; + `> (x u;)u; (16.19) 
i=1 i=M+1 
M 
= X+ `> (x) uj = xu) u; (16.20) 
i=1 


D 
x= (x uy) u, (16.21) 


which follows from the completeness of the {u;}. This represents a compression 
of the data set, because for each data point we have replaced the D-dimensional 
vector X,, with an M-dimensional vector having components (x7 u; — x uj). The 
smaller the value of M, the greater the degree of compression. Examples of PCA 
reconstructions of data points for the digits data set are shown in Figure 16.5. 


16.1.4 Data whitening 


Another application of PCA is to data pre-processing. In this case, the goal is 
not dimensionality reduction but rather the transformation of a data set to standard- 
ize certain of its properties. This can be important in allowing subsequent machine 
learning algorithms to be applied successfully to the data set. Typically, it is done 
when the original variables are measured in various different units or have signif- 
icantly different variabilities. For instance in the Old Faithful data set, the time 
between eruptions is typically an order of magnitude greater than the duration of an 
eruption. When we applied the k-means algorithm to this data set, we first made a 
separate linear re-scaling of the individual variables such that each variable had zero 
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Figure 16.4 (a) Plot of the eigenvalue spectrum for the data set of handwritten digits used in Figure 16.3. 
(b) Plot of the sum of the discarded eigenvalues, which represents the sum-of-squares error J introduced by 
projecting the data onto a principal component subspace of dimensionality M. 


mean and unit variance. This is known as standardizing the data, and the covariance 
matrix for the standardized data has components 


(Tni — (tnj — Ti) 
Pi =F D jt (16.22) 


aj 


where o; is the standard deviation of x;. This is known as the correlation matrix of 
the original data and has the property that if two components x; and x; of the data 
are perfectly correlated, then p;; = 1, and if they are uncorrelated, then p;; = 0. 
However, using PCA we can make a more substantial normalization of the data 
to give it zero mean and unit covariance, so that different variables become decorre- 


Original M = 250 
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Figure 16.5 An example from the data set of handwritten digits together with its PCA reconstructions obtained 
by retaining M principal components for various values of M. As M increases, the reconstruction becomes 
more accurate and would become perfect when M = D = 28 x 28 = 784. 
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Figure 16.6 Illustration of the effects of linear pre-processing applied to the Old Faithful data set. The plot on 
the left shows the original data. The centre plot shows the result of standardizing the individual variables to zero 
mean and unit variance. Also shown are the principal axes of this normalized data set, plotted over the range 


+)}/?. The plot on the right shows the result of whitening the data to give it zero mean and unit covariance. 


lated. To do this, we first write the eigenvector equation (16.17) in the form 
SU = UL (16.23) 


where L is a D x D diagonal matrix with elements A;, and U is a D x D orthog- 
onal matrix with columns given by u;. Then we define, for each data point xp, a 
transformed value given by 


Yn = L7UT (xn — X) (16.24) 


where X is the sample mean defined by (16.1). Clearly, the set {yn } has zero mean, 
and its covariance is given by the identity matrix because 


N N 
1 1 _ 
NW 5 Yny, = X `> L 12UT(x„ _ X) (Xp, _ x)TUL 1/2 
n=l n=1 
L PUTSUL "? = L'LL"? =I. (16.25) 


This operation is known as whitening or sphering the data and is illustrated for the 
Section 15.1 Old Faithful data set in Figure 16.6. 


16.1.5 High-dimensional data 


In some applications of PCA, the number of data points is smaller than the di- 
mensionality of the data space. For example, we might want to apply PCA to a data 
set of a few hundred images, each of which corresponds to a vector in a space of po- 
tentially several million dimensions (corresponding to three colour values for each 
of the pixels in the image). Note that in a D-dimensional space, a set of N points, 
where N < D, defines a linear subspace whose dimensionality is at most N — 1, and 
so there is little point in applying PCA for values of M that are greater than N — 1. 
Indeed, if we perform PCA we will find that at least D — N +1 of the eigenvalues are 
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zero, corresponding to eigenvectors along whose directions the data set has zero vari- 
ance. Furthermore, typical algorithms for finding the eigenvectors of a D x D matrix 
have a computational cost that scales like O(D*), and so for applications such as the 
image example, a direct application of PCA will be computationally infeasible. 

We can resolve this problem as follows. First, let us define X to be the (N x D)- 
dimensional centred data matrix, whose nth row is given by (x, — X)". The covari- 
ance matrix (16.3) can then be written as S = N~!X7™X, and the corresponding 
eigenvector equation becomes 


1 
ve Xu = djuj. (16.26) 
Now pre-multiply both sides by X to give 


<XX"(Xu,) = \;(Xu,). (16.27) 


If we now define v; = Xu,, we obtain 
1 
vox vi = \iVi, (16.28) 


which is an eigenvector equation for the N x N matrix N~'XX7. We see that this 
has the same NV — 1 eigenvalues as the original covariance matrix (which itself has an 
additional D — N + 1 eigenvalues of value zero). Thus, we can solve the eigenvector 
problem in spaces of lower dimensionality with computational cost O(N) instead 
of O(D*). To determine the eigenvectors, we multiply both sides of (16.28) by XT 
to give 

(3°) (XT vi) = A:(KTv;) (16.29) 
from which we see that (XTv;) is an eigenvector of S with eigenvalue A;. Note, 
however, that these eigenvectors are not necessarily normalized. To determine the 
appropriate normalization, we re-scale u; x Xv; by a constant such that ||u;|| = 1, 
which, assuming v; has been normalized to unit length, gives 


= xX ve (16.30) 


In summary, to apply this approach we first evaluate XXT and then find its eigen- 
vectors and eigenvalues and then compute the eigenvectors in the original data space 
using (16.30). 
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16.2. Probabilistic Latent Variables 
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We have seen in the previous section that PCA can be defined in terms of a linear 
projection of the data onto a subspace of lower dimensionality than the original data 
space. Each data point projects to a unique value of the quantities z,,; defined by 
(16.12), and we can view these quantities as deterministic latent variables. To intro- 
duce and motivate probabilistic continuous latent variables, we now show that PCA 
can also be expressed as the maximum likelihood solution of a probabilistic latent- 
variable model. This reformulation of PCA, known as probabilistic PCA, has several 
advantages compared with conventional PCA: 


e A probabilistic PCA model represents a constrained form of a Gaussian dis- 
tribution in which the number of free parameters can be restricted while still 
allowing the model to capture the dominant correlations in a data set. 


We can derive an EM algorithm for PCA that is computationally efficient in 
situations where only a few leading eigenvectors are required and that avoids 
having to evaluate the data covariance matrix as an intermediate step. 


The combination of a probabilistic model and EM allows us to deal with miss- 
ing values in the data set. 


Mixtures of probabilistic PCA models can be formulated in a principled way 
and trained using the EM algorithm. 


The existence of a likelihood function allows direct comparison with other 
probabilistic density models. By contrast, conventional PCA will assign a low 
reconstruction cost to data points that are close to the principal subspace even 
if they lie arbitrarily far from the training data. 


Probabilistic PCA can be used to model class-conditional densities and hence 
be applied to classification problems. 


A probabilistic PCA model can be run generatively to provide samples from 
the distribution. 


Probabilistic PCA forms the basis for a Bayesian treatment of PCA in which 
the dimensionality of the principal subspace can be found automatically from 
the data (Bishop, 2006). 


This formulation of PCA as a probabilistic model was proposed independently by 
Tipping and Bishop 1997; 1999 and by Roweis (1998). As we will see later, it is 
closely related to factor analysis (Basilevsky, 1994). 


16.2.1 Generative model 


Probabilistic PCA is a simple example of the linear-Gaussian framework in 
which all the marginal and conditional distributions are Gaussian. We can formu- 
late probabilistic PCA by first introducing an explicit /-dimensional latent variable 
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z corresponding to the principal-component subspace. Next we define a Gaussian 
prior distribution p(z) over the latent variable, together with a Gaussian conditional 
distribution p(x|z) for the D-dimensional observed variable x conditioned on the 
value of the latent variable. Specifically, the prior distribution over z is given by a 
zero-mean unit-covariance Gaussian: 


p(z) = N (z|0, D). (16.31) 


Similarly, the conditional distribution of the observed variable x, conditioned on the 
value of the latent variable z, is again Gaussian: 


p(x|z) = N(x|Wz + u, o°T) (16.32) 


in which the mean of x is a general linear function of z governed by the D x M 
matrix W and the D-dimensional vector u. Note that this factorizes with respect to 
the elements of x. In other words this is an example of a naive Bayes model. As 
we will see shortly, the columns of W span a linear subspace within the data space 
that corresponds to the principal subspace. The other parameter in this model is the 
scalar o? governing the variance of the conditional distribution. Note that there is no 
loss of generality in assuming a zero-mean unit-covariance Gaussian for the latent 
distribution p(z) because a more general Gaussian distribution would give rise to an 
equivalent probabilistic model. 
We can view the probabilistic PCA model from a generative viewpoint in which 
a sampled value of the observed variable is obtained by first choosing a value for 
the latent variable and then sampling the observed variable conditioned on this latent 
value. Specifically, the D-dimensional observed variable x is defined by a linear 
transformation of the M-dimensional latent variable z plus additive Gaussian noise, 
so that 
x=WŴz+u+e (16.33) 


where z is an M-dimensional Gaussian latent variable, and € is a D-dimensional 
zero-mean Gaussian-distributed noise variable with covariance o7I. This generative 
process is illustrated in Figure 16.7. Note that this framework is based on a mapping 
from latent space to data space, in contrast to the more conventional view of PCA 
discussed above. The reverse mapping, from data space to the latent space, will be 
obtained shortly using Bayes’ theorem. 


16.2.2 Likelihood function 


Suppose we wish to determine the values of the parameters W, p, and o° using 
maximum likelihood. To write down the likelihood function, we need an expression 
for the marginal distribution p(x) of the observed variable. This is expressed, from 
the sum and product rules of probability, in the form 


p(x) = J pte)pte) dz. (16.34) 


Because this corresponds to a linear-Gaussian model, this marginal distribution is 
again Gaussian, and is given by 
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Figure 16.7 An illustration of the generative view of a probabilistic PCA model for a two-dimensional data 
space and a one-dimensional latent space. An observed data point x is generated by first drawing a value 2 
for the latent variable from its prior distribution p(z) and then drawing a value for x from an isotropic Gaussian 
distribution (illustrated by the red circles) having mean w2 + ws and covariance oI. The green ellipses show the 
density contours for the marginal distribution p(x). 


p(x) = N(x|p, C) (16.35) 
where the D x D covariance matrix C is defined by 
C= WW! +0°l. (16.36) 


This result can also be derived more directly by noting that the predictive distribution 
will be Gaussian and then evaluating its mean and covariance using (16.33). This 
gives 


E(x] = E[Wz + u +e] =p (16.37) 
cov[x] = E |(Wz + €)(Wz + e)"] 

= E [Wzz"W"] + Elee*] (16.38) 

= WW" +I (16.39) 


where we have used the fact that z and € are independent random variables and hence 
are uncorrelated. 

Intuitively, we can think of the distribution p(x) as being defined by taking an 
isotropic Gaussian ‘spray can’ and moving it across the principal subspace spraying 
Gaussian ink with density determined by o? and weighted by the prior distribution. 
The accumulated ink density gives rise to a ‘pancake’ shaped distribution represent- 
ing the marginal density p(x). 

The predictive distribution p(x) is governed by the parameters u, W, and o°. 
However, there is redundancy in this parameterization corresponding to rotations of 
the latent space coordinates. To see this, consider a matrix W = WR where R is 
an orthogonal matrix. Using the orthogonality property RRT = I, we see that the 
quantity W WT that appears in the covariance matrix C takes the form 


Ww! = WRRIWT = WWT (16.40) 
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Figure 16.8 The probabilistic PCA model for a data set of N observa- 
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and hence is independent of R. Thus, there is a whole family of matrices W all of 
which give rise to the same predictive distribution. This invariance can be understood 
in terms of rotations within the latent space. We will return to a discussion of the 
number of independent parameters in this model later. 

When we evaluate the predictive distribution, we require C~', which involves 
the inversion of a D x D matrix. The computation required to do this can be reduced 
by making use of the matrix inversion identity (A.7) to give 


C~! =o °I -o ° WM WT (16.41) 
where the M x M matrix M is defined by 
M=W'W+0°L. (16.42) 


Because we invert M rather than inverting C directly, the cost of evaluating C~' is 
reduced from O(D*) to O(M*). 

As well as the predictive distribution p(x), we will also require the posterior 
distribution p(z|x), which can again be written down directly using the result (3.100) 
for linear-Gaussian models to give 


p(z|x) = N (2|M~!W' (x — u), 0° M7}). (16.43) 


Note that the posterior mean depends on x, whereas the posterior covariance is in- 
dependent of x. 


16.2.3 Maximum likelihood 


We next consider the determination of the model parameters using maximum 
likelihood. Given a data set X = {x,,} of observed data points, the probabilistic 
PCA model can be expressed as a directed graph, as shown in Figure 16.8. The 
corresponding log likelihood function is given, from (16.35), by 


N 
In p(X|p, W, o°) = `> In p(xn|W, hH, o°) 
n=l 
ND N te 


Ss te dna eC So (Xn — W)C (Xn — u). (16.44) 


n=1 
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Setting the derivative of the log likelihood with respect to pz equal to zero gives 
the expected result y = X where X is the data mean defined by (16.1). Because 
the log likelihood is a quadratic function of u, this solution represents the unique 
maximum, as can be confirmed by computing second derivatives. Back-substituting, 
we can then write the log likelihood function in the form 


N 
In p(X|W, u, 07) = Sa { D In(27) + In|C| + Tr (C7'S)} (16.45) 


where S is the data covariance matrix defined by (16.3). 

Maximization with respect to W and g? is more complex but nonetheless has 
an exact closed-form solution. It was shown by Tipping and Bishop (1999) that all 
the stationary points of the log likelihood function can be written as 


Wm = Um (Lu — PDR (16.46) 


where U m is a D x M matrix whose columns are given by any subset (of size M) 
of the eigenvectors of the data covariance matrix S. The M x M diagonal matrix 
Lm has elements given by the corresponding eigenvalues A;, and R is an arbitrary 
M x M orthogonal matrix. 

Furthermore, Tipping and Bishop (1999) showed that the maximum of the like- 
lihood function is obtained when the M eigenvectors are chosen to be those whose 
eigenvalues are the M largest (all other solutions being saddle points). A similar re- 
sult was conjectured independently by Roweis (1998), although no proof was given. 
Again, we will assume that the eigenvectors have been arranged in order of decreas- 
ing values of the corresponding eigenvalues, so that the M principal eigenvectors 
are u1,..., Um. In this case, the columns of W define the principal subspace of 
standard PCA. The corresponding maximum likelihood solution for o° is then given 
by 


1 


so that o%;;, is the average variance associated with the discarded dimensions. 

Because R is orthogonal, it can be interpreted as a rotation matrix in the M- 
dimensional latent space. If we substitute the solution for W into the expression for 
C and make use of the orthogonality property RRT = I, we see that C is indepen- 
dent of R. This simply says that the predictive density is unchanged by rotations 
in the latent space as discussed earlier. For the particular case R = I, we see that 
the columns of W are the principal component eigenvectors scaled by the variance 
parameters \; — o°. The interpretation of these scaling factors is clear once we rec- 
ognize that for a convolution of independent Gaussian distributions (in this case the 
latent space distribution and the noise model) the variances are additive. Thus, the 
variance A; in the direction of an eigenvector u; is composed of the sum of a contri- 
bution À; — o? from the projection of the unit-variance latent space distribution into 
data space through the corresponding column of W plus an isotropic contribution of 
variance o”, which is added in all directions by the noise model. 
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It is worth taking a moment to study the form of the covariance matrix given 
by (16.36). Consider the variance of the predictive distribution along some direction 
specified by the unit vector v, where v'v = 1, which is given by v'Cv. First 
suppose that v is orthogonal to the principal subspace, in other words it is given by 
some linear combination of the discarded eigenvectors. Then v'U = 0 and hence 
v'Cv = o°. Thus, the model predicts a noise variance orthogonal to the principal 
subspace, which from (16.47) is just the average of the discarded eigenvalues. Now 
suppose that v = u; where u; is one of the retained eigenvectors defining the prin- 
cipal subspace. Then v'Cv = (A; — o?) + o? = A;. In other words, this model 
correctly captures the variance of the data along the principal axes and approximates 
the variance in all remaining directions with a single average value o°. 

One way to construct the maximum likelihood density model would simply be 
to find the eigenvectors and eigenvalues of the data covariance matrix and then to 
evaluate W and øg? using the results given above. In this case, we would choose 
R = I for convenience. However, if the maximum likelihood solution is found by 
numerical optimization of the likelihood function, for instance using an algorithm 
such as conjugate gradients (Fletcher, 1987; Nocedal and Wright, 1999) or through 
the EM algorithm, then the resulting value of R is essentially arbitrary. This implies 
that the columns of W need not be orthogonal. If an orthogonal basis is required, 
the matrix W can be post-processed appropriately (Golub and Van Loan, 1996). Al- 
ternatively, the EM algorithm can be modified in such a way as to yield orthonormal 
principal directions, sorted in descending order of the corresponding eigenvalues, 
directly (Ahn and Oh, 2003). 

The rotational invariance in latent space represents a form of statistical non- 
identifiability, analogous to that encountered for mixture models for discrete latent 
variables. Here there is a continuum of parameters, any value of which leads to the 
same predictive density, in contrast to the discrete non-identifiability associated with 
component relabelling in the mixture setting. 

If we consider M = D, so that there is no reduction of dimensionality, then 
Um = U and Ly = L. Making use of the orthogonality properties UUT = I and 
RRT = I, we see that the covariance C of the marginal distribution for x becomes 


C = U(L - o’I) PRRI (L — o°I) UT +0°I = ULUT =S (16.48) 


and so we obtain the standard maximum likelihood solution for an unconstrained 
Gaussian distribution in which the covariance matrix is given by the sample covari- 
ance. 

Conventional PCA is generally formulated as a projection of points from the D- 
dimensional data space onto an M-dimensional linear subspace. Probabilistic PCA, 
however, is most naturally expressed as a mapping from the latent space into the data 
space via (16.33). For applications such as visualization and data compression, we 
can reverse this mapping using Bayes’ theorem. Any point x in data space can then 
be summarized by its posterior mean and covariance in latent space. From (16.43) 
the mean is given by 


E[z|x] = M~* Wiz, (x — X) (16.49) 
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where M is given by (16.42). This projects to a point in data space given by 


W E([z|x] + m. (16.50) 


Note that this takes the same form as the equations for regularized linear regression 
and is a consequence of maximizing the likelihood function for a linear-Gaussian 
model. Similarly, from (16.43) the posterior covariance is given by c7M7~! and is 
independent of x. 

If we take the limit o? — 0, then the posterior mean reduces to 


(WiLWm) Wri (x — F), (16.51) 


which represents an orthogonal projection of the data point onto the latent space, 
and so we recover the standard PCA model. The posterior covariance in this limit is 
zero, however, and the density becomes singular. For a? > 0, the latent projection 
is shifted towards the origin, relative to the orthogonal projection. 

Finally, note that an important role for the probabilistic PCA model is in defin- 
ing a multivariate Gaussian distribution in which the number of degrees of freedom, 
in other words the number of independent parameters, can be controlled while still 
allowing the model to capture the dominant correlations in the data. Recall that a 
general Gaussian distribution has D(D + 1)/2 independent parameters in its co- 
variance matrix (plus another D parameters in its mean). Thus, the number of pa- 
rameters scales quadratically with D and can become excessive in spaces of high 
dimensionality. If we restrict the covariance matrix to be diagonal, then it has only 
D independent parameters, and so the number of parameters now grows linearly 
with dimensionality. However, it now treats the variables as if they were indepen- 
dent and hence can no longer express any correlations between them. Probabilistic 
PCA provides an elegant compromise in which the M most significant correlations 
can be captured while still ensuring that the total number of parameters grows only 
linearly with D. We can see this by evaluating the number of degrees of freedom in 
the probabilistic PCA model as follows. The covariance matrix C depends on the 
parameters W, which has size D x M, and o°, giving a total parameter count of 
DM + 1. However, we have seen that there is some redundancy in this parameter- 
ization associated with rotations of the coordinate system in the latent space. The 
orthogonal matrix R that expresses these rotations has size M x M. In the first 
column of this matrix, there are M — 1 independent parameters, because the column 
vector must be normalized to unit length. In the second column, there are M — 2 
independent parameters, because the column must be normalized and also must be 
orthogonal to the previous column, and so on. Summing this arithmetic series, we 
see that R has a total of 1/(M — 1)/2 independent parameters. Thus, the number of 
degrees of freedom in the covariance matrix C is given by 


DM +1—M(M —1)/2. (16.52) 


The number of independent parameters in this model therefore only grows linearly 
with D, for fixed M. If we take M = D — 1, then we recover the standard result 
for a full covariance Gaussian. In this case, the variance along D — 1 linearly in- 
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dependent directions is controlled by the columns of W, and the variance along the 
remaining direction is given by o°. If M = 0, the model is equivalent to the isotropic 
covariance case. 


16.2.4 Factor analysis 


Factor analysis is a linear-Gaussian latent-variable model that is closely related 
to probabilistic PCA. Its definition differs from that of probabilistic PCA only in that 
the conditional distribution of the observed variable x given the latent variable z has 
a diagonal rather than an isotropic covariance so that 


p(x|z) = N(x|Wz + p, ©) (16.53) 


where W is a D x D diagonal matrix. Note that the factor analysis model, in com- 
mon with probabilistic PCA, assumes that the observed variables z1,..., £p are 
independent, given the latent variable z. In essence, a factor analysis model explains 
the observed covariance structure of the data by representing the independent vari- 
ance associated with each coordinate in the matrix W and capturing the covariance 
between variables in the matrix W. In the factor analysis literature, the columns 
of W, which capture the correlations between observed variables, are called factor 
loadings, and the diagonal elements of W, which represent the independent noise 
variances for each of the variables, are called uniquenesses. 

The origins of factor analysis are as old as those of PCA, and discussions of 
factor analysis can be found in the books by Everitt (1984), Bartholomew (1987), 
and Basilevsky (1994). Links between factor analysis and PCA were investigated 
by Lawley (1953) and Anderson (1963), who showed that at stationary points of 
the likelihood function, for a factor analysis model with © = o7I, the columns of 
W are scaled eigenvectors of the sample covariance matrix and g? is the average 
of the discarded eigenvalues. Later, Tipping and Bishop (1999) showed that the 
maximum of the log likelihood function occurs when the eigenvectors comprising 
W are chosen to be the principal eigenvectors. 

Making use of (16.34), we see that the marginal distribution for the observed 
variable is given by p(x) = N (x|, C) where now 


C=WW'!+W. (16.54) 


As with probabilistic PCA, this model is invariant to rotations in the latent space. 

Historically, factor analysis has been the subject of controversy when attempts 
have been made to place an interpretation on the individual factors (the coordinates 
in z-space), which has proven problematic due to the non-identifiability of factor 
analysis associated with rotations in this space. From our perspective, however, we 
shall view factor analysis as a form of latent-variable density model, in which the 
form of the latent space is of interest but not the particular choice of coordinates 
used to describe it. If we wish to remove the degeneracy associated with latent- 
space rotations, we must consider non-Gaussian latent-variable distributions, giving 
rise to independent component analysis models. 

Another difference between probabilistic PCA and factor analysis is their be- 
haviour under transformations of the data set. For PCA and probabilistic PCA, if we 
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rotate the coordinate system in data space, then we obtain exactly the same fit to the 
data but with the W matrix transformed by the corresponding rotation matrix. How- 
ever, for factor analysis, the analogous property is that if we make a component-wise 
re-scaling of the data vectors, then this is absorbed into a corresponding re-scaling 
of the elements of Y. 


16.2.5 Independent component analysis 


One generalization of the linear-Gaussian latent-variable model is to consider 
models in which the observed variables are related linearly to the latent variables, 
but for which the latent distribution is non-Gaussian. An important class of such 
models, known as independent component analysis, or ICA, arises when we consider 
a distribution over the latent variables that factorizes, so that 


M 


p(z) = ] [ pl). (16.55) 


j=1 


To understand the role of such models, consider a situation in which two people 
are talking at the same time, and we record their voices using two microphones. 
If we ignore effects such as time delay and echoes, then the signals received by 
the microphones at any point in time will be given by linear combinations of the 
amplitudes of the two voices. The coefficients of this linear combination will be 
constant, and if we can infer their values from sample data, then we can invert the 
mixing process (assuming it is non-singular) and thereby obtain two clean signals 
each of which contains the voice of just one person. This is an example of a problem 
called blind source separation in which ‘blind’ refers to the fact that we are given 
only the mixed data, and neither the original sources nor the mixing coefficients are 
observed (Cardoso, 1998). 

This type of problem is sometimes addressed using the following approach 
(MacKay, 2003) in which we ignore the temporal nature of the signals and treat the 
successive samples as i.i.d. We consider a generative model in which there are two 
latent variables corresponding to the unobserved speech signal amplitudes, and there 
are two observed variables given by the signal values at the microphones. The latent 
variables have a joint distribution that factorizes as above, and the observed variables 
are given by a linear combination of the latent variables. There is no need to include 
a noise distribution because the number of latent variables equals the number of ob- 
served variables, and therefore the marginal distribution of the observed variables 
will not in general be singular, so the observed variables are simply deterministic 
linear combinations of the latent variables. Given a data set of observations, the 
likelihood function for this model is a function of the coefficients in the linear com- 
bination. The log likelihood can be maximized using gradient-based optimization 
giving rise to a particular version of ICA. 

The success of this approach requires that the latent variables have non-Gaussian 
distributions. To see this, recall that in probabilistic PCA (and in factor analysis) the 
latent-space distribution is given by a zero-mean isotropic Gaussian. The model 
therefore cannot distinguish between two different choices for the latent variables 
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if these differ simply by a rotation in latent space. This can be verified directly 
by noting that the marginal density (16.35), and hence the likelihood function, is 
unchanged if we make the transformation W — WR where R is an orthogonal 
matrix satisfying RRT = I, because the matrix C given by (16.36) is itself invariant. 
Extending the model to allow more general Gaussian latent distributions does not 
change this conclusion because, as we have seen, such a model is equivalent to the 
zero-mean isotropic Gaussian latent-variable model. 

Another way to see why a Gaussian latent-variable distribution in a linear model 
is insufficient to find independent components is to note that the principal compo- 
nents represent a rotation of the coordinate system in data space so as to diagonalize 
the covariance matrix. The data distribution in the new coordinates is then uncorre- 
lated. Although zero correlation is a necessary condition for independence it is not, 
however, sufficient. In practice, a common choice for the latent-variable distribution 
is given by 

1 2 


P(z3) = Geog) ee ay (16.56) 


which has heavy tails compared to a Gaussian, reflecting the observation that many 
real-world distributions also exhibit this property. 

The original ICA model (Bell and Sejnowski, 1995) was based on the optimiza- 
tion of an objective function defined by information maximization. One advantage 
of a probabilistic latent-variable formulation is that it helps to motivate and formu- 
late generalizations of basic ICA. For instance, independent factor analysis (Attias, 
1999) considers a model in which the number of latent and observed variables can 
differ, the observed variables are noisy, and the individual latent variables have flex- 
ible distributions modelled by mixtures of Gaussians. The log likelihood for this 
model is maximized using EM, and the reconstruction of the latent variables is ap- 
proximated using a variational approach. Many other types of model have been 
considered, and there is now a huge literature on ICA and its applications (Jutten 
and Herault, 1991; Comon, Jutten, and Herault, 1991; Amari, Cichocki, and Yang, 
1996; Pearlmutter and Parra, 1997; Hyvärinen and Oja, 1997; Hinton et al., 2001; 
Miskin and MacKay, 2001; Hojen-Sorensen, Winther, and Hansen, 2002; Choudrey 
and Roberts, 2003; Chan, Lee, and Sejnowski, 2003; Stone, 2004). 


16.2.6 Kalman filters 


So far we have assumed that the data values are i.i.d. A common situation in 
which this assumption does not hold is when the data points form an ordered se- 
quence. We have seen that a hidden Markov model can be viewed as an extension 
of the mixture models to allow for sequential correlations in the data. In a similar 
way, a continuous latent-variable model can be extended to handle sequential data 
by connecting the latent variables to form a Markov chain, as shown in the graph- 
ical model of Figure 16.9. This is known as a linear dynamical system or Kalman 
filter (Zarchan and Musoff, 2005). Note that this is the same graphical structure as 
a hidden Markov model. It is interesting to note that, historically, hidden Markov 
models and linear dynamical systems were developed independently. Once they are 
both expressed as graphical models, however, the deep relationship between them 
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Figure 16.9 A probabilistic graphical model for se- 
quential data, known as a linear dynami- (1) Si —+(2x) 
cal system, or Kalman filter, in which the 
latent variables form a Markov chain. 
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immediately becomes apparent. Kalman filters are widely used in many real-time 
tracking applications, for example to track aircraft using radar reflections. 

In the simplest such model, the distributions p(x,,|z,,) in Figure 16.9 represent a 
linear-Gaussian latent-variable model for that particular observation, of the kind we 
have discussed previously for i.i.d. data. However, the latent variables {z,,} are no 
longer treated as independent but now form a Markov chain in which the distribution 
p(Zn|Zn—1) of each latent variable is conditioned on the state of the previous latent 
variable in the chain. Again these can be chosen to be linear-Gaussian in which 
the distribution of z,, is Gaussian with a mean given by a linear function of Z,_1. 
Typically the parameters of all the distributions p(x,,|z,,) are shared, and likewise 
the parameters of the distributions p(Zn|Zn—1) are shared, so that the total number 
of parameters in the model is fixed, independently of the length of the sequence. 
These parameters can be learned from data by maximum likelihood with efficient 
algorithms that involve propagating messages around the graph (Bishop, 2006). For 
the rest of this chapter, however, we will focus on i.i.d. data. 


Evidence Lower Bound 


In our discussion of models with discrete latent variables, we derived the evidence 
lower bound (ELBO) on the marginal log likelihood and showed how this forms the 
basis for deriving the expectation—maximization (EM) algorithm including its gener- 
alizations such as variational inference. The same framework applies to continuous 
latent variables as well as to models that combine both discrete and continuous vari- 
ables. Here we present a slightly different derivation of the ELBO, and we assume 
that the latent variables z are continuous. 

Consider a model p(x, z|w) with an observed variable x, a latent variable z, and 
a learnable parameter vector w. If we introduce an arbitrary distribution g(z) over 
the latent variable then we can write the log likelihood function In p(x|w) as a sum 
of two terms in the form 


In p(x|w) = L(w) + KL (q(2)||p(2|x, w)) (16.57) 


where we have defined 
L(q,w) = faem peaa dz (16.58) 
q(z) 


KL (q(z)||p(z|x, w)) = - fae) In {pam | dz. (16.59) 
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Since KL (q(z)||p(z|x, w)) is a Kullback—Leibler divergence, it satisfies the prop- 
erty KL (-||-) > 0 from which it follows that 


In p(x|w) > L(w) (16.60) 


and we therefore see that £(q, w) given by (16.58) forms a lower bound on the log 
likelihood, known as the evidence lower bound or ELBO. We see that £(q, w) takes 
the same form (15.53 ) as derived for the discrete case but with summations replaced 
by integrals. 

We can maximize the log likelihood function using a two-stage iterative proce- 
dure called the expectation maximization algorithm, or EM algorithm, in which we 
alternately maximize L(q, w) with respect to q(z) (the E step) and w (the M step). 
We first initialize the parameters w‘°!“), Then in the E step we keep w fixed and 
we maximize the lower bound with respect to q(z). This is easily done by noting 
that the highest value for the bound is obtained by minimizing the Kullback—Leibler 
divergence in (16.59) and hence is achieved when q(z) = p(z|x, w') for which 
the Kullback—Leibler divergence is zero. In the M step, we keep this choice of q(z) 
fixed and maximize £(q, w) with respect to w. Substituting for q(z) in (16.58) we 
obtain 


Llaw) = | plalz, wo") Inplx, zw) dz 
— | plex, w9) mpl, wd, (16.61) 


We now maximize this with respect to w in the M step while keeping w°'”) fixed. 
Note that the second term on the right-hand side of (16.61) is independent of w and 
so can be ignored during the M step. The first term on the right-hand side is the 
expectation of the complete data log likelihood where the expectation is taken with 
respect to the posterior distribution of z computed using w‘°!®), 

If we have a data set x,,...,Xy Of 1.1.d. observations then the likelihood func- 
tion takes the form 


N 
In p(X|w) = ys In p(x,,|w) (16.62) 
n=1 


where the data matrix X comprises x;,...,X,, and the parameters w are shared 
across all data points. For each data point we introduce a corresponding latent vari- 
able z,, with its associated distribution q(z,,), and by following similar steps to those 
used to derive (16.58), we obtain the ELBO in the form 


L(q,w) = S J aen)n (202) dain. (16.63) 


When we discuss variational autoencoders, we will encounter a model for which 
an exact solution to the E step is not feasible so instead a partial maximization is 
performed by modelling q(z) using a deep neural network and then using the ELBO 
to learn the parameters of the network. 
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16.3.1 Expectation maximization 


We can now use the EM algorithm, derived by iteratively maximizing the ev- 
idence lower bound, to learn the parameters of the probabilistic PCA model. This 
may seem rather pointless because we have already obtained an exact closed-form 
solution for the maximum likelihood parameter values. However, in spaces of high 
dimensionality, there may be computational advantages in using an iterative EM 
procedure rather than working directly with the sample covariance matrix. This EM 
procedure can also be extended to the factor analysis model, for which there is no 
closed-form solution. Finally, it allows missing data to be handled in a principled 
way. 
We can derive the EM algorithm for probabilistic PCA by following the general 
framework for EM. Thus, we write down the complete-data log likelihood and take 
its expectation with respect to the posterior distribution of the latent distribution 
evaluated using ‘old’ parameter values. Maximization of this expected complete- 
data log likelihood then yields the ‘new’ parameter values. Because the data points 
are assumed independent, the complete-data log likelihood function takes the form 


N 
Inp (X, Z|u, W, 07) = y {In p(Xn|Zn) + ln p(Zn)} (16.64) 


n=1 


where the nth row of the matrix Z is given by zn. We already know that the exact 
maximum likelihood solution for p is given by the sample mean X defined by (16.1), 
and it is convenient to substitute for yz at this stage. Making use of the expressions 
(16.31) and (16.32) for the latent and conditional distributions, respectively, and tak- 
ing the expectation with respect to the posterior distribution over the latent variables, 
we obtain 


N 


D 1 
[In p(X, Z|, W, 0°)] = — 2 (3 In(2r0?) + 5 (Efznzn]) 
f 1 2 ie Ty 7T 
Faga lXn — HI” — Elen] Wo (Xn — p) 
1 fe eae M 
+ 552 (Elen en |W W) + z (27) : (16.65) 
Oo 


Note that this depends on the posterior distribution only through the sufficient statis- 
tics of the Gaussian. Thus, in the E step, we use the old parameter values to evaluate 


Elza] = M~*W" (x, — xX) (16.66) 
E[ZnZ,] = o? M + E|z,]Elzn]", (16.67) 


which follow directly from the posterior distribution (16.43) together with the stan- 
dard result E|z„z 1] = cov|zn] + E[zn]E[zn]". Here M is defined by (16.42). 

In the M step, we maximize with respect to W and o°, keeping the posterior 
statistics fixed. Maximization with respect to o? is straightforward. For the maxi- 
mization with respect to W, we make use of (A.24) to obtain the M-step equations: 
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-1 

Whew = Ym tal ` tea (16.68) 
Trew = ND 2 {lxn — F|? — 2E[zn]” Wirew (Xn — F) 

+Tr (Elznzn]Wrew Wnew) } - (16.69) 


The EM algorithm for probabilistic PCA proceeds by initializing the parameters 
and then alternately computing the sufficient statistics of the latent space posterior 
distribution using (16.66) and (16.67) in the E step and revising the parameter values 
using (16.68) and (16.69) in the M step. 

One of the benefits of the EM algorithm for PCA is its computational efficiency 
for large-scale applications (Roweis, 1998). Unlike conventional PCA based on an 
eigenvector decomposition of the sample covariance matrix, the EM approach is 
iterative and so might appear to be less attractive. However, each cycle of the EM 
algorithm can be computationally much more efficient than conventional PCA in 
spaces of high dimensionality. To see this, note that the eigendecomposition of the 
covariance matrix requires O(D*) computation. Often we are interested only in the 
first M eigenvectors and their corresponding eigenvalues, in which case we can use 
algorithms that are O(M D2). However, evaluating the covariance matrix requires 
O(N D?) computations, where N is the number of data points. Algorithms such 
as the snapshot method (Sirovich, 1987), which assume that the eigenvectors are 
linear combinations of the data vectors, avoid a direct evaluation of the covariance 
matrix but are O(N?) and hence unsuited to large data sets. The EM algorithm 
described here also does not construct the covariance matrix explicitly. Instead, the 
most computationally demanding steps are those involving sums over the data set 
that are O(NDM). For large D, and M < D, this can be a significant saving 
compared to O(N D?) and can offset the iterative nature of the EM algorithm. 

Note that this EM algorithm can be implemented in an online form in which 
each D-dimensional data point is read in and processed and then discarded before 
the next data point is considered. To see this, note that the quantities evaluated in 
the E step (an M-dimensional vector and an M x M matrix) can be computed for 
each data point separately, and in the M step we need to accumulate sums over data 
points, which we can do incrementally. This approach can be advantageous if both 
N and D are large. 

Because we now have a fully probabilistic model for PCA, we can deal with 
missing data, provided that it is missing at random, in other words that the process 
that determines which values are missing does not depend on the values of any ob- 
served or unobserved variables. Such data sets can be handled by marginalizing over 
the distribution of the unobserved variables, and the resulting likelihood function can 
be maximized using the EM algorithm. 


16.3.2 EM for PCA 


Another elegant feature of the EM approach is that we can take the limit c? — 0, 
corresponding to standard PCA, and still obtain a valid EM-like algorithm (Roweis, 


520 16. CONTINUOUS LATENT VARIABLES 


Exercise 16.23 


Section 16.2.4 


Exercise 16.24 


1998). From (16.67), we see that the only quantity we need to compute in the E step 
is E|z,,]. Furthermore, the M step is simplified because M = WTW. To emphasize 


n 


the simplicity of the algorithm, let us define X to be a matrix of size N x D whose 
nth row is given by the vector x, — X and similarly define Q to be a matrix of size 
M x N whose nth column is given by the vector E[z,,]. The E step (16.66) of the 
EM algorithm for PCA then becomes 


Q = (Wa Woa) WX" (16.70) 
and the M step (16.68) takes the form 
Wrew = XTaT(QQ7)-. (16.71) 


Again these can be implemented in an online form. These equations have a simple 
interpretation as follows. From our earlier discussion, we see that the E step involves 
an orthogonal projection of the data points onto the current estimate for the principal 
subspace. Correspondingly, the M step represents a re-estimation of the principal 
subspace to minimize the reconstruction error in which the projections are fixed. 

We can give a simple physical analogy for this EM algorithm, which is easily 
visualized for D = 2 and M = 1. Consider a collection of data points in two 
dimensions, and let the one-dimensional principal subspace be represented by a solid 
rod. Now attach each data point to the rod via a spring obeying Hooke’s law (force 
is proportional to the length of the spring and therefore stored energy is proportional 
to the square of the spring’s length). In the E step, we keep the rod fixed and allow 
the attachment points to slide up and down the rod so as to minimize the energy. 
This causes each attachment point (independently) to position itself at the orthogonal 
projection of the corresponding data point onto the rod. In the M step, we keep the 
attachment points fixed and then release the rod and allow it to move to the minimum 
energy position. The E step and M step are then repeated until a suitable convergence 
criterion is satisfied, as is illustrated in Figure 16.10. 


16.3.3 EM for factor analysis 


We can determine the parameters u, W, and © in a factor analysis model by 
maximum likelihood. The solution for pz is again given by the sample mean. How- 
ever, unlike probabilistic PCA, there is no longer a closed-form maximum likelihood 
solution for W, which must therefore be found iteratively. Because factor analysis is 
a latent-variable model, this can be done using an EM algorithm (Rubin and Thayer, 
1982) that is analogous to the one used for probabilistic PCA. Specifically, the E-step 
equations are given by 


E[zn]) = GWT! (x, — xX) (16.72) 
G+Elz,,JE[zn]" (16.73) 


N` 
3 
N 
3 
l 


where we have defined 
G = (I+ WTW). (16.74) 
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Figure 16.10 Synthetic data illustrating the EM algorithm for PCA defined by (16.70) and (16.71). (a) A set 
of data points shown in green, together with the true principal components (shown as eigenvectors scaled by 
the square roots of the eigenvalues). (b) Initial configuration of the principal subspace defined by W, shown in 
red, together with the projections of the latent points Z into the data space, given by ZWT, shown in cyan. (c) 
After one M step, W has been updated with Z held fixed. (d) After the successive E step, the values of Z have 
been updated, giving orthogonal projections, with W held fixed. (e) After the second M step. (f) The converged 


solution. 


Exercise 16.25 


Note that this is expressed in a form that involves inversion of matrices of size M x M 
rather than D x D (except for the D x D diagonal matrix W whose inverse is trivial 
to compute in O(D) steps), which is convenient because often M « D. Similarly, 
the M-step equations take the form 


Wrew 


Drew 


where the diag operator sets all the non-diagonal elements of a matrix to zero. 


N 


ee E X) 


n=1 


diag fs -W 


tel > tea 


1 N 
new Hy S > Elza] (Xn — =} (16.76) 


(16.75) 
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16.4. Nonlinear Latent Variable Models 


So far in this chapter we have focused on latent variable models based on linear trans- 
formations from the latent space to the data space. It is natural to ask whether we 
can use the flexibility of deep neural networks to represent more complex transfor- 
mations, while exploiting the learning ability of deep networks to allow the resulting 
distribution to be fitted to a data set. Consider a simple distribution over a vector 
variable z, for example a Gaussian of the form 


p2(z) = N (2/0, I). (16.77) 


Now suppose we transform z using a function x = g(z, w) given by a deep neu- 
ral network, where w represents the weights and biases. The combination of the 
distribution over z together with the neural network defines a distribution over x. 
Sampling from such a model is straightforward because we can generate samples 
from p,(z) and then transform each of them using the neural network function to 
give corresponding samples of x. This is an efficient process since it does not in- 
volve iteration. 

To learn g(z, w) from data, consider how to evaluate the likelihood function 
p(x|w). The distribution of x is given by the change of variables formula for densi- 
ties: 


Px(X) = p2(z(x)) |det J(x)| (16.78) 
where J is the Jacobian matrix of partial derivatives whose elements are given by 
Oz; 
Jij (x) = ‘ 16.79 


To evaluate the distribution p,(z(x)) on the right-hand side of (16.78) for a given 
data vector x and to evaluate the Jacobian matrix in (16.79) for that same value of 
x, we need the inverse z = g~‘(x,w) of the neural network function. For most 
neural networks this inverse will not be well defined. For example, the network may 
represent a many-to-one function in which multiple different input values map to 
the same output value, in which case the change of variable formula does not give a 
well-defined density. Moreover, if the dimensionality of the latent space is different 
from that of the data space then the transformation will not be invertible. 

One approach is to restrict our attention to functions g(z, w) that are invertible, 
which requires that z and x have the same dimensionality. We will explore this 
approach in more detail when we introduce the technique of normalizing flows. 


16.4.1 Nonlinear manifolds 


Requiring that the latent and data spaces have the same number of dimensions 
is a significant limitation. Consider the situation in which z has dimensionality M 
and x has dimensionality D, where M < D. In this case the distribution over x 
is confined to a manifold, or subspace, of dimensionality M, as illustrated in Fig- 
ure 16.11. Low-dimensional manifolds arise in many machine learning applications, 


Figure 16.11 


Section 6.1.4 


Section 14.1.2 


Figure 16.12 
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Illustration of a mapping x3 
from a two-dimensional la- g(z,w) 
tent space z = (21,22) 
to a three-dimensional data 
space x = (2%1,%2,2%3) US- 
ing a nonlinear function x = 
g(z,w) represented by a 
neural network with param- 
eter vector w. 


zı 


for example when modelling the distribution of natural images. Nonlinear latent- 
variable models can be very useful in modelling such data because they express the 
strong inductive bias that the data does not ‘fill’ the data space but is confined to a 
manifold, although the shape and dimensionality of this manifold are typically not 
known in advance. 

However, one problem with this framework is that it assigns zero probability 
density to any data vector that does not lie exactly on the manifold, which is a prob- 
lem for gradient-based learning since the likelihood function will be zero at each of 
the data points and constant for small changes in w, for any realistic data set. To ad- 
dress this, we follow the approach used previously with regression and classification 
problems and define a conditional distribution across the entire data space, whose pa- 
rameters are given by the output of the neural network. If, for example, x comprises 
a vector of continuous variables then we can choose the conditional distribution to 
be a Gaussian: 

p(x|z, w) = N(x|g(z, w), 071) (16.80) 


in which the neural network g(z, w) has linear output-unit activation functions, and 
g € RP. The generative model is specified by the latent distribution over z to- 
gether with the conditional distribution over x, and can be represented by the simple 
graphical model shown in Figure 16.12. 

Note that it is straightforward, and computationally efficient, to draw indepen- 
dent samples from this distribution. We first draw a sample from the Gaussian distri- 
bution (16.77) using standard methods. Next, we use this value as input to the neural 
network, giving an output value g(z, w). Finally, we draw a sample from a Gaus- 
sian distribution with mean g(z, w) and covariance o7I, as defined by (16.80). This 
three-step process can then be repeated to generate multiple independent samples. 

The combination of a latent-variable distribution p(z) and a conditional distri- 


Graphical model representing the distribution given by (16.77) and 
(16.80), which together define a joint distribution p(x, z) = p(x|z)p(z). 
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Figure 16.13 
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(b) 


Illustration of a nonlinear latent-variable model for a one-dimensional latent space and a two- 


dimensional data space. (a) The prior distribution in latent space is given by a zero-mean unit-variance Gaussian 
distribution. (b) The three left-most plots show examples of the Gaussian conditional distribution p(x|z) for 
different values of z, whereas the right-most plot shows the marginal distribution p(x). The nonlinear function 
g(z), which defines the mean of the conditional distribution, is given by gi(z) = sin(z), go(z) = cos(z), and, 
therefore, traces out a circle in data space. The standard deviation of the conditional distribution is given by 
ao = 0.3. [Based on Prince (2020) with permission.] 


bution p(x|z) defines a marginal distribution over the data space given by 


p(x) = f p2)p(xlx) a. (16.81) 


We illustrate this using a simple example involving a one-dimensional latent space 
and a two-dimensional data space in Figure 16.13. 


16.4.2 Likelihood function 


We have seen that it is easy to draw samples from this nonlinear latent-variable 
model. Now suppose we wish to fit the model to an observed data set by maximizing 
the likelihood function. The likelihood is obtained from the product and sum rules 
of probability by integrating over z: 


p(x|w) = [ lz wo dz 
= J Nlet w), DN Co, T) dz. (16.82) 


Although both distributions inside the integral are Gaussian, the integral is analyti- 
cally intractable due to the highly nonlinear function g(z, w) defined by the neural 
network. 


Figure 16.14 Three example images 
of handwritten digits, illustrating why 
sampling from the latent space to evalu- 
ate the likelinood function requires large 
numbers of samples. (a) shows the 
original image, (b) shows a corrupted 
image with part of the stroke removed, 
and (c) shows the original image shifted 
by half a pixel down and half a pixel to 
the right. Image (b) is closer to (a) in 
terms of likelinood, even though image 
(c) is much closer to (a) in appearance. 
[From Doersch (2016) with permission.] 
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(b) 


One approach for evaluating the likelihood function would be to draw samples 
from the latent space distribution and use these to approximate (16.82) by 


p(x|w) ~ = X plz, w) (16.83) 


where z; ~ p(z). This expresses the distribution over z as a mixture of Gaussians 
with fixed mixing coefficients given by 1/K, and in the limit of an infinite number of 
samples, this gives the true likelihood function. However, the value of K needed for 
effective training will typically be far too high to be practical. To see why, consider 
the three images of handwritten digits shown in Figure 16.14, and suppose that image 
(a) represents the vector x for which we wish to evaluate the likelihood function. If 
a trained model generated image (b), we would consider this a poor model as this 
image is not a good representation of a digit ‘2’, and so this should be assigned 
a much lower likelihood. Conversely, image (c), which was obtained by shifting 
the digit in (a) down and to the right by half a pixel, is a good example of a digit 
‘2’ and should therefore have a high likelihood. Since the distribution (16.80) is 
Gaussian, the likelihood function is proportional to the exponential of the negative 
squared distance between the output of the network and the data vector x. However, 
the squared distance between (a) and (b) is 0.0387 whereas the squared distance 
between (a) and (c) is 0.2693. So if the variance parameter ø? is set to a sufficiently 
small value that image (b) has low likelihood, then image (c) will have an even lower 
likelihood. Even if the model is good at generating digits, we would have to consider 
extremely large numbers of samples for z before seeing a digit that is sufficiently 
close to (a). We therefore seek more sophisticated techniques for training nonlinear 
latent variable models that can be used in practical applications. Before outlining 
such methods, we first discuss briefly some considerations regarding discrete data 
spaces. 
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Figure 16.15 Schematic illustration of de- 
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quantization, showing (a) 
a discrete distribution over 
a single variable and (b) 
an associated dequantized 
continuous distribution. 


16.4.3 Discrete data 


If the observed data set comprises independent binary variables then we can use 
a conditional distribution of the form 


D 
p(x|z,w) = ][si.w) (1 — gi(z,w))' (16.84) 


i=1 


where g;(z,w) = o(a;(z, w)) represents the activation of output unit i, the activa- 
tion function o(-) is given by the logistic sigmoid, and a;(z, w) is the pre-activation 
for output unit 7. Similarly, for one-hot encoded categorical variables, we can use a 
multinomial distribution: 


p(x|z, w) = | [ 9:(z,w)” (16.85) 


where 
(zZ,w) = exp(a;(z, w)) 
gil ’ ) ar exp(a;(z, w)) 


is the softmax activation function. We can also consider combinations of discrete 
and continuous variables by forming the product of the associated conditional distri- 
butions. 

In practice, continuous variables are represented with discrete values, for exam- 
ple in images, the red, green, and blue channel intensities might be expressed using 
8-bit numbers representing the values {0, ..., 255}. This can cause problems when 
we employ highly flexible models based on deep neural networks, as the likelihood 
function can go to zero if the density collapses onto one or more of the discrete val- 
ues. The problem can be resolved using a technique called dequantization, which 
involves adding noise to the variables, typically drawn from a uniform distribution 
over the region between successive discrete values, as shown in Figure 16.15. A 
training set is dequantized by replacing each observed value with a sample drawn 
randomly from the associated continuous distribution associated with that discrete 
value, and this makes it less likely that the model will discover a pathological solu- 
tion. 


(16.86) 
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16.4.4 Four approaches to generative modelling 


We have seen that nonlinear latent-variable models based on deep neural net- 
works offer a highly flexible framework for building generative models. Due to the 
universality of the neural network transformation, such models are capable, in prin- 
ciple, of approximating essentially any desired distribution to high accuracy. More- 
over, such models offer the potential, once trained, to generate samples from the 
distribution in using an efficient, non-iterative process. However, we have also iden- 
tified some challenges associated with training such models that force us to develop 
more sophisticated techniques than those needed for linear models. Many such meth- 
ods have been proposed, each having their own strengths and limitations. These can 
be broadly grouped into four approaches, as follows. 

With generative adversarial networks, or GANs, we relax the requirement for 
the network mapping to be invertible, thereby allowing the latent space to have a 
lower dimensionality than the data space. We also abandon the concept of a likeli- 
hood function and instead introduce a second neural network whose function is to 
provide a training signal for the generative network. Due to the absence of a well- 
defined likelihood function, the training procedure may be brittle, but once trained it 
is straightforward to generate samples from the model, and the results can be of high 
quality. 

The framework of variational autoencoders, or VAEs, also uses a second neural 
network whose role is to approximate the posterior distribution over the latent vari- 
ables, thereby allowing an approximation to the likelihood function to be evaluated. 
Training is more robust than with GANs, and sampling from the trained model is 
straightforward, although it can be harder to obtain the highest quality results. 

In normalizing flows, we set the dimensionality of the latent space to be equal 
to that of the data space and then modify the generative neural network so that it 
becomes invertible. The requirement that the network is invertible restricts its func- 
tional form but it allows the likelihood function to be evaluated without approxima- 
tion and it also allows for efficient sampling. 

Finally, diffusion models use a network that learns to transform a sample from 
the prior distribution into a sample from the data distribution through a sequence of 
denoising steps. This leads to state-of-the-art performance in many applications, al- 
though the cost of sampling can be high due to the multiple denoising passes through 
the network. 

We explore these approaches in detail in the final four chapters of this book. 


(xx) In this exercise, we use proof by induction to show that the linear projection 
onto an /-dimensional subspace that maximizes the variance of the projected data 
is defined by the M eigenvectors of the data covariance matrix S, given by (16.3), 
corresponding to the M largest eigenvalues. In Section 16.1, this result was proven 
for M = 1. Now suppose the result holds for some general value of M and show that 
it consequently holds for dimensionality M + 1. To do this, first set the derivative 
of the variance of the projected data with respect to a vector ujy+1 defining the new 
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direction in data space equal to zero. This should be done subject to the constraints 
that um+ı are orthogonal to the existing vectors u,,...,U,s, and also that it is 

Appendix C normalized to unit length. Use Lagrange multipliers to enforce these constraints. 
Then make use of the orthonormality properties of the vectors u,,..., up to show 
that the new vector uyjy,; is an eigenvector of S. Finally, show that the variance is 
maximized if the eigenvector is chosen to be the one corresponding to eigenvalue 
Am-+1 Where the eigenvalues have been ordered in decreasing value. 


16.2 (xx) Show that the minimum value of the PCA error measure J given by (16.15) 
with respect to the u;, subject to the orthonormality constraints (16.7), is obtained 
when the u; are eigenvectors of the data covariance matrix S. To do this, introduce a 
matrix H of Lagrange multipliers, one for each constraint, so that the modified error 
measure, in matrix notation reads 


j=Tr {ost} +Tr {na z 010)} (16.87) 


where U is a matrix of dimension D x (D — M) whose columns are given by u;. 
Now minimize J with respect to U and show that the solution satisfies SU = UH. 
Clearly, one possible solution is that the columns of U are eigenvectors of S, in 
which case H is a diagonal matrix containing the corresponding eigenvalues. To 
obtain the general solution, show that H can be assumed to be a symmetric matrix, 
and by using its eigenvector expansion, show that the general solution to SU = UH 
gives the same value for J as the specific solution in which the columns of U are 
the eigenvectors of S. Because these solutions are all equivalent, it is convenient to 
choose the eigenvector solution. 


16.3 (+) Verify that the eigenvectors defined by (16.30) are normalized to unit length, 
assuming that the eigenvectors v; have unit length. 


16.4 (x) Suppose we replace the zero-mean, unit-covariance latent space distribution (16.31) 
in the probabilistic PCA model by a general Gaussian distribution of the form N (z|m, X). 
By redefining the parameters of the model, show that this leads to an identical model 
for the marginal distribution p(x) over the observed variables for any valid choice of 
m and X. 


16.5 (xx) Let x be a D-dimensional random variable having a Gaussian distribution given 
by N(x|u, =), and consider the M-dimensional random variable given by y = 
Ax + b where A is an M x D matrix. Show that y also has a Gaussian distribution, 
and find expressions for its mean and covariance. Discuss the form of this Gaussian 
distribution for M < D, for M = D, and for M > D. 


16.6 (xx) By making use of the results (2.122) and (2.123) for the mean and covariance 
of a general distribution, derive the result (16.35) for the marginal distribution p(x) 
in the probabilistic PCA model. 


16.7 (x) Draw a directed probabilistic graph for the probabilistic PCA model described in 
Section 16.2 in which the components of the observed variable x are shown explicitly 


16.8 


16.9 


16.10 


16.11 


16.12 


16.13 


16.14 


16.15 


16.16 


16.17 
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as separate nodes. Hence, verify that the probabilistic PCA model has the same 
independence structure as the naive Bayes model discussed in Section 11.2.3. 


(x x) By making use of the result (3.100), show that the posterior distribution p(z|x) 
for the probabilistic PCA model is given by (16.43). 


(x) Verify that maximizing the log likelihood (16.44) for the probabilistic PCA model 
with respect to the parameter u gives the result umr, = X where X is the mean of the 
data vectors. 


(x x) By evaluating the second derivatives of the log likelihood function (16.44) for 
the probabilistic PCA model with respect to the parameter jz, show that the stationary 
point Hyr = X represents the unique maximum. 


(x x) Show that in the limit 7? — 0, the posterior mean for the probabilistic PCA 
model becomes an orthogonal projection onto the principal subspace, as in conven- 
tional PCA. 


(xx) For o? > 0 show that the posterior mean in the probabilistic PCA model is 
shifted towards the origin relative to the orthogonal projection. 


(x x) Show that the optimal reconstruction of a data point under probabilistic PCA, 
according to the least-squares projection cost of conventional PCA, is given by 


X = WumuL(W fr Wot) 'ME[z|x]. (16.88) 


(x) The number of independent parameters in the covariance matrix for a probabilis- 
tic PCA model with an 1/-dimensional latent space and a D-dimensional data space 
is given by (16.52). Verify that for M = D — 1, the number of independent param- 
eters is the same as in a general covariance Gaussian, whereas for M = 0 it is the 
same as for a Gaussian with an isotropic covariance. 


(x) Derive an expression for the number of independent parameters in the factor 
analysis model described in Section 16.2.4. 


(x x) Show that the factor analysis model described in Section 16.2.4 is invariant 
under rotations of the latent space coordinates. 


(xx) Consider a linear-Gaussian latent-variable model having a latent space distri- 
bution p(z) = M (x|0, I) and a conditional distribution for the observed variable 
p(x|z) = N(x|Wz + u, ®) where © is an arbitrary symmetric positive-definite 
noise covariance matrix. Now suppose that we make a non-singular linear transfor- 
mation of the data variables x — Ax, where A isa D x D matrix. If um, WML, 
and Pm represent the maximum likelihood solution corresponding to the original 
un-transformed data, show that A uy, AW mL, and A®, A" represent the corre- 
sponding maximum likelihood solution for the transformed data set. Finally, show 
that the form of the model is preserved in two cases: (i) A is a diagonal matrix 
and ® is a diagonal matrix. This corresponds to factor analysis. The transformed 
® remains diagonal, and hence factor analysis is covariant under component-wise 
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16.18 


16.19 


16.20 


16.21 


16.22 


16.23 


re-scaling of the data variables; (ii) A is orthogonal and ® is proportional to the unit 
matrix so that ® = o7I. This corresponds to probabilistic PCA. The transformed ® 
matrix remains proportional to the unit matrix, and hence probabilistic PCA is co- 
variant under a rotation of the axes of the data space, as is the case for conventional 
PCA. 


(x) Verify that the log likelihood function for a model with continuous latent vari- 
ables can be written as the sum of two terms in the form (16.57) in which the terms 
are defined by (16.58) and (16.59). This can be done by using the product rule of 
probability in the form 


p(x, z|w) = p(z|x, w)p(x|w) (16.89) 
and then substituting for p(x, z|w) in (16.58). 


(x) Show that, for a set of iid. data, the evidence lower bound (ELBO) takes the 
form (16.63). 


(xx) Draw a directed probabilistic graphical model representing a discrete mixture 
of probabilistic PCA models in which each PCA model has its own values of W, 
p, and o*. Now draw a modified graph in which these parameter values are shared 
between the components of the mixture. 


(xx) Derive the M-step equations (16.68) and (16.69) for the probabilistic PCA 
model by maximizing the expected complete-data log likelihood function given by 


(16.65). 


(x xx) One benefit of a probabilistic formulation of principal component analysis is 
that it can be applied to a data set in which some of the values are missing, provided 
they are missing at random. Derive the EM algorithm for maximizing the likelihood 
function for the probabilistic PCA model in this situation. Note that the {Z }, as well 
as the missing data values that are components of the vectors {Xn }, are now latent 
variables. Show that in the special case in which all the data values are observed, 
this reduces to the EM algorithm for probabilistic PCA derived in Section 16.3.2. 


(xx) Let W be a D x M matrix whose columns define a linear subspace of di- 
mensionality M embedded within a data space of dimensionality D, and let u be a 
D-dimensional vector. Given a data set {x,,} where n = 1,...,.N, we can approx- 
imate the data points using a linear mapping from a set of M/-dimensional vectors 
{Zn}, so that x, is approximated by Wz,, + u. The associated sum-of-squares 
reconstruction cost is given by 


N 
J =X ||xn — u- Wa”. (16.90) 


n=1 


First show that minimizing J with respect to yz leads to an analogous expression with 
Xn and Zp replaced by zero-mean variables x,, — X and Zn — Z, respectively, where x 
and Z denote sample means. Then show that minimizing J with respect to Zn, where 


16.24 


16.25 


16.26 
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W is kept fixed, gives rise to the PCA E step (16.70), and that minimizing J with 
respect to W, where {Zn } is kept fixed, gives rise to the PCA M step (16.71). 


(x x) Derive the formulae (16.72) and (16.73) for the E step of the EM algorithm for 
factor analysis. Note that from the result of Exercise 16.26, the parameter u can be 
replaced by the sample mean x. 


(x x) Write down an expression for the expected complete-data log likelihood func- 
tion for the factor analysis model, and hence derive the corresponding M-step equa- 
tions (16.75) and (16.76). 


(x x) By considering second derivatives, show that the only stationary point of the 
log likelihood function for the factor analysis model discussed in Section 16.2.4 
with respect to the parameter yz is given by the sample mean defined by (16.1). 
Furthermore, show that this stationary point is a maximum. 


Check for 
updates 


1/ 


Generative 
Adversarial 
Networks 


Generative models use machine learning algorithms to learn a distribution from a set 
of training data and then generate new examples from that distribution. For example, 
a generative model might be trained on images of animals and then used to generate 
new images of animals. We can think of such a generative model in terms of a 
distribution p(x|w) in which x is a vector in the data space, and w represent the 
learnable parameters of the model. In many cases we are interested in conditional 
generative models of the form p(x|c, w) where c represents a vector of conditioning 
variables. In the case of our generative model for animal images, we may wish to 
specify that a generated image should be of a particular animal, such as a cat or a 
dog, specified by the value of c. 

For real-world applications such as image generation, the distributions are ex- 
tremely complex, and consequently the introduction of deep learning has dramati- 
cally improved the performance of generative models. We have already encountered 
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Figure 17.1 
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17.1. 


real images 


synthetic images 


Schematic illustration of a GAN in which a discriminator neural network d(x, œ) is trained 
to distinguish between real samples from the training set, in this case images of kittens, 
and synthetic samples produced by the generator network g(z, w). The generator aims 
to maximize the error of the discriminator network by producing realistic images, whereas 
the discriminator network tries to minimize the same error by becoming better at distin- 
guishing real from synthetic examples. 


an important class of deep generative models when we discussed autoregressive 
large language models based on transformers. We have also outlined four impor- 
tant classes of generative model based on nonlinear latent variable models, and in 
this chapter we discuss the first of these, called generative adversarial networks. The 
other three approaches will be discussed in subsequent chapters. 


Adversarial Training 


Consider a generative model based on a nonlinear transformation from a latent space 
z to a data space x. We introduce a latent distribution p(z), which might take the 
form of a simple Gaussian 


p(z) = N (z|0, I), (17.1) 


along with with a nonlinear transformation x = g(z, w) defined by a deep neural 
network with learnable parameters w known as the generator. Together these im- 
plicitly define a distribution over x, and our goal is to fit this distribution to a data 
set of training examples {x,,} where n = 1,...,.N. However, we cannot determine 
w by optimizing the likelihood function because this cannot, in general, be evalu- 
ated in closed form. The key idea of generative adversarial networks, or GANs, 
(Goodfellow et al., 2014; Ruthotto and Haber, 2021) is to introduce a second dis- 
criminator network, which is trained jointly with the generator network and which 
provides a training signal to update the weights of the generator. This is illustrated 
in Figure 17.1. 


Section 1.2.4 
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The goal of the discriminator network is to distinguish between real examples 
from the data set and synthetic, or ‘fake’, examples produced by the generator net- 
work, and it is trained by minimizing a conventional classification error function. 
Conversely, the goal of the generator network is to maximize this error by synthe- 
sizing examples from the same distribution as the training set. The generator and 
discriminator networks are therefore working against each other, hence the term ‘ad- 
versarial’. This is an example of a zero-sum game in which any gain by one network 
represents a loss to the other. It allows the discriminator network to provide a training 
signal, which can be used to train the generator network, and this turns the unsuper- 
vised density modelling problem into a form of supervised learning. 


17.1.1 Loss function 


To make this precise, we define a binary target variable given by 


t=1, _ real data, (17.2) 
t=0, synthetic data. (17.3) 


The discriminator network has a single output unit with a logistic-sigmoid activation 
function, whose output represents the probability that a data vector x is real: 


P(t =1) = d(x, $). (17.4) 


We train the discriminator network using the standard cross-entropy error function, 
which takes the form 


N 
E(w, ġ) = n `> {tn ln dn + (1 — tn) ln(1 — dn)} (17.5) 
n=1 


where dn = d(Xn,@) is the output of the discriminator network for input vector 
n, and we have normalized by the total number of data points. The training set 
comprises both real data examples denoted x,, and synthetic examples given by the 
output of the generator network g(z,,w) where z,, is a random sample from the 
latent space distribution p(z). Since tp = 1 for real examples and tn = 0 for 
synthetic examples, we can write the error function (17.5) in the form 


1 
Ecan(w, o) = N. Ind(xn, o) 
real n€real 
1 
-y In(l—d(g(@n,w),)) (17.6) 
synth 


nEsynth 


where typically the number N;ca) of real data points is equal to the number Nsynth 
of synthetic data points. This combination of generator and discriminator networks 
can be trained end-to-end using stochastic gradient descent with gradients evalu- 
ated using backpropagation. However, the unusual aspect is the adversarial training 
whereby the error is minimized with respect to @ but maximized with respect to w. 
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This maximization can be done using standard gradient-based methods with the sign 
of the gradient reversed, so that the parameter updates become 


Ad = —\VgEn(w, Q) (17.7) 
Aw = \VwEn(w, Q) (17.8) 


where En(w, œ) denotes the error defined for data point n or more generally for 
a mini-batch of data points. Note that the two terms in (17.7) and (17.8) have dif- 
ferent signs since the discriminator is trained to decrease the error rate whereas the 
generator is trained to increase it. In practice, training alternates between updating 
the parameters of the generative network and updating those of the discriminative 
network, in each case taking just one gradient descent step using a mini-batch, af- 
ter which a new set of synthetic samples is generated. If the generator succeeds in 
finding a perfect solution, then the discriminator network will be unable to tell the 
difference between the real and synthetic data and hence will always produce an out- 
put of 0.5. Once the GAN is trained, the discriminator network is discarded and the 
generator network can be used to synthesize new examples in the data space by sam- 
pling from the latent space and propagating those samples through the trained gen- 
erator network. We can show that for generative and discriminative networks having 
unlimited flexibility, a fully optimized GAN will have a generative distribution that 
matches the data distribution exactly. Some impressive examples of synthetic face 
images generated by a GAN are shown in Figure 1.3. 

The GAN model discussed so far generates samples from the unconditional dis- 
tribution p(x). For example, it could generate synthetic images of dogs if it is trained 
on dog images. We can also create conditional GANs (Mirza and Osindero, 2014), 
which sample from a conditional distribution p(x|c) in which the conditioning vec- 
tor c might, for example, represent different species of dog. To do this, both the 
generator and the discriminator network take c as an additional input, and labelled 
examples of images, comprising pairs {x,,, €n }, are used for training. Once the GAN 
has been trained, images from a desired class can be generated by setting c to the 
corresponding class vector. Compared to training separate GANs for each class, this 
has the advantage that shared internal representations can be learned jointly across 
all classes, thereby making more efficient use of the data. 


17.1.2 GAN training in practice 


Although GANs can produce high quality results, they are not easy to train suc- 
cessfully due to the adversarial learning. Also, unlike standard error function min- 
imization, there is no metric of progress because the objective can go up as well as 
down during training. 

One challenge that can arise is called mode collapse, in which the generator net- 
work weights adapt during training such that all latent-variable samples z are mapped 
to a subset of possible valid outputs. In extreme cases the output can correspond to 
just one, or a small number, of the output values x. The discriminator then assigns 
the value 0.5 to these instances, and training ceases. For example, a GAN trained on 
handwritten digits might learn to generate only examples of the digit ‘3°’, and while 


17.1. Adversarial Training 537 


_—+ PData(x) 
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Figure 17.2 Conceptual illustration of why it can be difficult to train GANs, showing a simple one- 
dimensional data space x with the fixed, but unknown, data distribution ppata(x) and the 
initial generative distribution pe (x). The optimal discriminator function d(x) has virtually 
zero gradient in the vicinity of either the training or synthetic data points, making learn- 


ing very slow. A smoothed version d(x) of the discriminator function can lead to faster 
learning. 


the discriminator is unable to distinguish these from genuine examples of the digit 
*3’, it fails to recognize that the generator is not generating the full range of digits. 

Insight into the difficulty of training GANs can be obtained by considering Fig- 
ure 17.2, which shows a simple one§-dimensional data space x with samples {£n } 
drawn from the fixed, but unknown, data distribution ppata(z). Also shown is the 
initial generative distribution pq(x) together with samples drawn from this distri- 
bution. Because the data and generative distributions are so different, the optimal 
discriminator function d(x) is easy to learn and has a very steep fall-off with virtu- 
ally zero gradient in the vicinity of either the real or synthetic samples. Consider 
the second term in the GAN error function (17.6). Because d(g(z, w), œ) is equal 
to zero across the region spanned by the generated samples, small changes in the 
parameters w of the generative network produce very little change in the output of 
the discriminator and so the gradients are small and learning proceeds slowly. 

This can be addressed by using a smoothed version d(x) of the discriminator 
function, illustrated in Figure 17.2, thereby providing a stronger gradient to drive 
the training of the generator network. The least-squares GAN (Mao et al., 2016) 
achieves smoothing by modifying the discriminator to produce a real-valued output 
rather than a probability in the range (0, 1) and by replacing the cross-entropy error 
function with a sum-of-squares error function. Alternatively, the technique of in- 
stance noise (Sønderby et al., 2016) adds Gaussian noise to both the real data and 
the synthetic samples, again leading to a smoother discriminator function. 

Numerous other modifications to the GAN error function and training procedure 
have been proposed to improve training (Mescheder, Geiger, and Nowozin, 2018). 
One change that is often used is to replace the generative network term in the original 
error function 
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Figure 17.3 Plots of — In(d) and In(1 — d) showing the very dif- 5 
ferent behaviour of the gradients close to d = 0 and a 
d=1. 
0 


0 1 
T 
1 
= In(1 — d(g(Zn, w), $)) (17.9) 
Nsynth nEsynth 
with the modified form 
1 

a In d(g(Zn,w), Q). (17.10) 

Neynth n€synth 


Although the first form minimizes the probability that the image is fake, the second 
version maximizes the probability that the image is real. The different properties 
of these two forms can be understood from Figure 17.3. When the generative dis- 
tribution pq(x) is very different from the true data distribution ppata(x), the quan- 
tity d(g(z,w)) is close to zero, and hence the first form has a very small gradient, 
whereas the second form has a large gradient, leading to faster training. 

A more direct way to ensure that the generator distribution pq (x) moves towards 
the data distribution pgata(x) is to modify the error criterion to reflect how far apart 
the two distributions are in data space. This can be measured using the Wasserstein 
distance, also known as the earth mover’s distance. Imagine the distribution pc (x) 
as a pile of earth that is transported in small increments to construct the distribution 
Paata(z). The Wasserstein metric is the total amount of earth moved multiplied by 
the mean distance moved. Of the many ways of rearranging the pile of earth to build 
Paata(x), the one that yields the smallest mean distance is the one used to define 
the metric. In practice, this cannot be implemented directly, and it is approximated 
by using a discriminator network that has real-valued outputs and then limiting the 
gradient V,,d(x, œ) of the discriminator function with respect to x by using weight 
clipping, giving rise to the Wasserstein GAN (Arjovsky, Chintala, and Bottou, 2017). 
An improved approach is to introduce a penalty on the gradient, giving rise to the 
gradient penalty Wasserstein GAN (Gulrajani et al., 2017) whose error function is 
given by 


1 2 
Ewoan-cp(w, $) = — W [in A(xn, $) — n (|| Vin d(Xn, &) ||? — 1) | 
real n€real 
1 
N In d(g(zn, w, )) (17.11) 
synth né€synth 


where y controls the relative importance of the penalty term. 
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The basic concept of the GAN has given rise to a huge research literature, with many 
algorithmic developments and numerous applications. One of the most widespread 
and successful application areas for GANs is the generation of images. Early GAN 
models used fully connected networks for the generator and discriminator. How- 
ever, there are many benefits to using convolutional networks, especially for images 
of higher resolution. The discriminator network takes an image as input and pro- 
vides a scalar probability as output, so a standard convolutional network is appro- 
priate. The generator network needs to map a lower-dimensional latent space into a 
high-resolution image, and so a network based on transpose convolutions is used, as 
illustrated in Figure 17.4. 

High quality images can be obtained by progressively growing both the gener- 
ator network and the discriminator network starting from a low resolution and then 
successively adding new layers that model increasingly fine details as training pro- 
gresses (Karras et al., 2017). This speeds up the training and permits the synthesis of 
high-resolution images of size 1024 x 1024 starting from images of size 4 x 4. As an 
example of the scale and complexity of some GAN architectures, consider the GAN 
model for class-conditional image generation called BigGAN, whose architecture is 
shown in Figure 17.5. 


17.2.1 CycleGAN 


As an example of the broad variety of GANs we consider an architecture called 
a CycleGAN (Zhu et al., 2017). This also illustrates how techniques in deep learning 
can be adapted to solve different kinds of problems beyond traditional tasks such as 


4x 4 x 1024 
8x8 x 512 


16 x 16 x 256 


32 x 32 x 128 


64 x 64 x 3 


Figure 17.4 Example architecture of a deep convolutional GAN showing the use of transpose convolutions to 
expand the dimensionality in successive blocks of the network. 
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Figure 17.5 (a) Architecture of the generative network in the BigGAN model, which has over 70 million param- 
eters. (b) Details of each of the residual blocks in the generative network. The discriminative network, which has 
88 million parameters, has a somewhat analogous structure except that it uses average pooling layers to reduce 
the dimensionality, instead of using up-sampling to increase the dimensionality. [Based on Brock, Donahue, and 
Simonyan (2018).] 


classification and density estimation. Consider the problem of turning a photograph 
into a Monet painting of the same scene, or vice versa. In Figure 17.6 we show 
examples of image pairs from a trained CycleGAN that has learned to perform such 
an image-to-image translation. 

The aim is to learn two bijective (one-to-one) mappings, one that goes from the 
domain X of photographs to the domain Y of Monet paintings and one in the reverse 
direction. To achieve this, CycleGAN makes use of two conditional generators, gx 
and gy, and two discriminators, dx and dy. The generator gx(y, wx) takes as 
input a sample painting y € Y and generates a corresponding synthetic photograph, 
whereas the discriminator dx (x, py) distinguishes between synthetic and real pho- 
tographs. Similarly, the generator gy (x, wy ) takes a photograph x € X as input 
and generates a synthetic painting y, and the discriminator dy (y, @y ) distinguishes 
between synthetic paintings and real ones. The discriminator dx is therefore trained 
on a combination of synthetic photographs generated by gx and real photographs, 
whereas dy is trained on a combination of synthetic paintings generated by gy and 
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Figure 17.6 Examples of image translation 
using a CycleGAN showing the synthesis of a 
photographic-style image from a Monet paint- 
ing (top row) and the synthesis of an image 
in the style of a Monet painting from a photo- 
graph (bottom row). [From Zhu et al. (2017) 
with permission.] 


photograph > Monet 


real paintings. 

If we train this architecture using the standard GAN loss function, it would learn 
to generate realistic synthetic Monet paintings and realistic synthetic photographs, 
but there would be nothing to force a generated painting to look anything like the 
corresponding photograph, or vice versa. We therefore introduce an additional term 
in the loss function called the cycle consistency error, containing two terms, whose 
construction is illustrated in Figure 17.7. 

The goal is to ensure that when a photograph is translated into a painting and 
then back into a photograph it should be close to the original photograph, thereby 
ensuring that the generated painting retains sufficient information about the photo- 
graph to allow the photograph to be reconstructed. Similarly, when a painting is 
translated into a photograph and then back into a painting it should be close to the 


Figure 17.7 Diagram showing how the cycle 
consistency error is calculated for an example 
photograph xn. The photograph is first mapped 
into the painting domain using the generator gy, 
and the resulting vector is then mapped back 
into the photograph domain using the genera- 
tor gx. The discrepancy between the resulting 
photograph and the original x,, defines a contri- 
bution to the cycle consistency error. An analo- 
gous process is used to calculate the contribu- 
tion to the cycle consistency error from a paint- 
ing yn by mapping it to a photograph using gx photographs paintings 
and then back to a painting using gy. 
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Figure 17.8 Flow of information through a CycleGAN. The total error for the data points x, and yn is the sum 
of the four component errors. 
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original painting. Applying this to all the photographs and paintings in the training 
set then gives a cycle consistency error of the form 


1 
Ecyc(Wx, Wy) = es X Ilex (gy (Kn) — xnllı 


nex 
1 
+ ao DL lgrexa) -yalh (17.12) 
Y 
ney 
where || - ||; denotes the L1 norm. The cycle consistency error is added to the usual 


GAN loss functions defined by (17.6) to give a total error function: 


Ecan(wx, ox) + Ecan(wy, gy) + NE eyc(Wx, wy) (17.13) 


where the coefficient 7 determines the relative importance of the GAN errors and the 
cycle consistency error. Information flow through the CycleGAN when calculating 
the error function for one image and one painting is shown in Figure 17.8. 

We have seen that GANs can perform well as generative models, but they can 
also be used for representation learning in which rich statistical structure in a data 
set is revealed through unsupervised learning. When the deep convolutional GAN 
shown in Figure 17.4 is trained on a data set of bedroom images (Radford, Metz, and 
Chintala, 2015) and random samples from the latent space are propagated through 
the trained network, the generated images also look like bedrooms, as expected. In 
addition, however, the latent space has become organized in ways that are semanti- 
cally meaningful. For example, if we follow a smooth trajectory through the latent 
space and generate the corresponding series of images, we obtain smooth transitions 
from one image to the next, as seen in Figure 17.9. 

Moreover, it is possible to identify directions in latent space that correspond 
to semantically meaningful transformations. For example, for faces, one direction 
might correspond to changes in the orientation of the face, whereas other directions 
might correspond to changes in lighting or the degree to which the face is smiling or 
not. These are called disentangled representations and allow new images to be syn- 
thesized having specified properties. Figure 17.10 is an example from a GAN trained 
on face images, showing that semantic attributes such as gender or the presence of 
glasses correspond to particular directions in latent space. 
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Samples generated by a deep convolutional GAN trained on images of bedrooms. Each row is 
generated by taking a smooth walk through latent space between randomly generated locations. We see smooth 
transitions, with each image plausibly looking like a bedroom. In the bottom row, for example, we see a TV on 
the wall gradually morph into a window. [From Radford, Metz, and Chintala (2015) with permission.] 


An example of vector 
arithmetic in the latent space of a 
trained GAN. In each of the three 
columns, the latent space vectors that 
generated these images are averaged 
and then vector arithmetic is applied 
to the resulting mean vectors to cre- 
ate a new vector corresponding to the 
central image in the 3 x 3 array on the 
right. Adding noise to this vector gen- 
erates another eight sample images. 
The four images on the bottom row 
show that the same arithmetic applied 
directly in data space simply results 
in a blurred image due to misalign- 
ment. [From Radford, Metz, and Chin- 
tala (2015) with permission.] 
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17.1 


17.2 


(x x x) We would like the GAN error function (17.6) to have the property that, given 
sufficiently flexible neural networks, the stationary point is obtained when the gen- 
erator distribution matches the true data distribution. In this exercise we prove this 
result for network models with infinite flexibility by optimizing over the full space 
of probability distributions pg (x) and over the full space of functions d(x) corre- 
sponding to the generative and discriminative networks, respectively. Specifically, 
we assume that the discriminative model is optimized in an inner loop, giving rise to 
an effective outer loop error function for the generative model. First, show that, in 
the limit of an infinite number of data samples, the GAN error function (17.6) can be 
rewritten in the form 


E(pa,d) = — [ Pass) In d(x) dx — fre% ln(1 — d(x)) dx (17.14) 


where Paata(X) is the fixed distribution of real data points. Now consider a varia- 
tional optimization over all functions d(x). Show that, for a fixed generative net- 
work, the solution for the discriminator d(x) that minimizes E is given by 


* Daata(X) 
d — ; 17.15 
m Dante (X) + pe (x) ( ) 


Hence, show that the error function E can be written as a function of the generator 
network po (x) in the form 


C(pa) =- f Paata(X) inf Paata (x) l dx 


Paata (X) + PG (x) 


- 1 pa(x) inf — z so =x} dx. (17.16) 


Now show that this can be rewritten in the form 


Padata + PG 
2 


C(pa) = — In(4) + KL (uss (17.17) 


+ KL (vo | 


Pdata + PG 
2 


where the Kullback—Leibler divergence KL(p||q) is defined by (2.100). Finally, us- 
ing the property that KL(p||q) > 0 with equality if, and only if, p(x) = q(x) for 
all x, show that the minimum of C (pg) occurs when pa(x) = Paata(x). Note 
that the sum of the two Kullback—Leibler divergence terms in (17.17) is known as 
the Jensen—Shannon divergence between Paata and pq. Like the Kullback—Leibler 
divergence, this is a non-negative quantity that vanishes if, and only if, the two dis- 
tributions are equal, but unlike the KL divergence, it is symmetric with respect to the 
two distributions. 


(xxx) In this exercise we explore the problems that can arise from the adversarial 
nature of GAN training. Consider a cost function E(a,b) = ab defined over two 
parameters a and b, analogous to the parameters of a generative and discriminative 
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network, respectively. Show that the point a = 0,b = O is a stationary point of 
the cost function. By considering the second derivatives along the lines b = a and 
b = —a show that the point a = 0,b = 0 is a saddle point. Now suppose that we 
optimize this error function by taking infinitesimal steps, so that the variables be- 
come functions of continuous time a(t), b(t) defined by a continuous-time gradient 
descent, in which the parameter a(t) of the generative network is updated so as to 
increase F(a, b), whereas the parameter b(t) is updated so as to decrease E (a,b). 
Show that the evolution of the parameters is governed by the equations 


da OE db OE 
ae Sp — = —-n—. 17.1 
dt "0a" dt a ee 
Hence, show that a(t) satisfies the second-order differential equation 
da 2 
—; = —1a(t). 17.19 
dtz n ( ) ( ) 
Verify that the following expression is a solution of (17.19): 
a(t) = Ccos(nt) + D sin(7t) (17.20) 


where C and D are arbitrary constants. If the system is initialized at t = 0 with the 
values a = 1, b = 0, find the values of C and D and hence show that the resulting 
values of a(t) and b(t) trace out a circle of unit radius in a, b space centred on the 
origin, and that they therefore never converge to the saddle point. 


(x) Consider a GAN in which the training set consists of equal numbers of cat and 
dog images and in which the generator network has learned to produce high quality 
images of dogs. Show that, when presented with a dog image, the optimal output for 
the discriminator network (trained to generate the probability that the image is real) 
is 1/3. 
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Normalizing 
Flows 


We have seen how generative adversarial networks (GANs) extend the framework 
of linear latent-variable models by using deep neural networks to represent highly 
flexible and learnable nonlinear transformations from the latent space to the data 
space. However, the likelihood function is generally either intractable, because the 
network function cannot be inverted, or may not even be defined if the latent space 
has a lower dimensionality than the data space. In GANs, a second, discriminative 
network was therefore introduced to facilitate adversarial training. 

Here we discuss the second of our four approaches to training nonlinear latent 
variable models that involves restricting the form of the neural network model such 
that the likelihood function can be evaluated without approximation while still en- 
suring that sampling from the trained model is straightforward. Suppose we define 
a distribution p,(z), sometimes also called a base distribution, over a latent variable 
z along with a nonlinear function x = f(z, w), given by a deep neural network, that 
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transforms the latent space into the data space. Assuming p,(z) is a simple distribu- 
tion such as a Gaussian, sampling from such a model is easy as each latent sample 
z* ~ p,(zZ) is simply passed through the neural network to generate a corresponding 
data sample x* = f(z*, w). 

To calculate the likelihood function for this model, we need the data-space dis- 
tribution, which depends on the inverse of the neural network function. We write 
this as z = g(x, w), and it satisfies z = g(f (z, w), w). This requires that, for every 
value of w, the functions f(z, w) and g(x, w) are invertible, also called bijective, so 
that each value of x corresponds to a unique value of z and vice versa. We can then 
use the change of variables formula to calculate the data density: 


Px(x|w) = pa(g(x, w)) |det J(x)| (18.1) 
where J(x) is the Jacobian matrix of partial derivatives whose elements are given by 


i= a (18.2) 
Ox j 

and |-| denotes the modulus or absolute value. We will continue to refer to z as a ‘la- 
tent’ variable even though the deterministic mapping means that any given data value 
x corresponds to a unique value of z whose value is therefore no longer uncertain. 

The mapping function f(z, w) will be defined in terms of a special form of neu- 
ral network, whose structure we will discuss shortly. One consequence of requiring 
an invertible mapping is that the dimensionality of the latent space must be the same 
as that of the data space, which can lead to large models for high-dimensional data 
such as images. Also, in general, the cost of evaluating the determinant of a D x D 
matrix is O(D*), so we will seek to impose some further restrictions on the model 
in order that evaluation of the Jacobian matrix determinant is more efficient. 

If we consider a training set D = {x,...,xy} of independent data points, the 
log likelihood function is given from (18.1) by 


N 
In p(D|w) = y In px(Xn|w) (18.3) 
n=1 
N 
— 5 {In pa(g(xn; w)) + In |det I(x,)|} (18.4) 
n=1 


and our goal is to use the likelihood function to train the neural network. To be able 
to model a wide range of distributions, we want the transformation function x = 
f(z, w) to be highly flexible, and so we use a deep neural network architecture. We 
can ensure that the overall function is invertible if we make each layer of the network 
invertible. To see this, consider three successive transformations, each corresponding 
to one layer, of the form: 

x = f4 (£P (£F (2))). (18.5) 


Then the inverse function is given by 


z = g°(g?(g4(x))) (18.6) 
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where g4, g?, and g® are the inverse functions of f4,f%, and f©, respectively. 
Moreover, the determinant of the Jacobian for such a layered structure is also easy 
to evaluate in terms of the Jacobian determinants for each of the individual layers by 
making use of the chain rule of calculus: 


Oz; dg° OgP dq 
k l 


Orj Ogg Ogh Ox; 


We recognize the right-hand side as the product of three matrices, and the determi- 
nant of a product is the product of the determinants. Therefore, the log determinant 
of the overall Jacobian will be the sum of the log determinants corresponding to each 
layer. 

This approach to modelling a flexible distribution is called a normalizing flow 
because the transformation of a probability distribution through a sequence of map- 
pings is somewhat analogous to the flow of a fluid. Also, the effect of the inverse 
mapping is to transform the complex data distribution into a normalized form, typ- 
ically a Gaussian or normal distribution. Normalizing flows have been reviewed by 
Kobyzev, Prince, and Brubaker (2019) and Papamakarios et al. (2019). Here we 
discuss the core concepts from the two main classes of normalizing flows used in 
practice: coupling flows and autoregressive flows. We also look at the use of neural 
differential equations to define invertible mappings, leading to continuous flows. 


Coupling Flows 


Our goal is to design a single invertible function layer, so that we can compose many 
of them together to define a highly flexible class of invertible functions. Consider 
first a linear transformation of the form 


x=az+b. (18.8) 


This is easy to invert, giving 
1 
z= —(x—b). (18.9) 
a 


However, linear transformations are closed under composition, meaning that a se- 
quence of linear transformations is equivalent to a single overall linear transforma- 
tion. Moreover, a linear transformation of a Gaussian distribution is again Gaussian. 
So even if we have many such ‘layers’ of linear transformation, we will only ever 
have a Gaussian distribution. The question is whether we can retain the invertability 
of a linear transformation while allowing additional flexibility so that the resulting 
distribution can be non-Gaussian. 

One solution to this problem is given by a form of normalizing flow model called 
real NVP (Dinh, Krueger, and Bengio, 2014; Dinh, Sohl-Dickstein, and Bengio, 
2016), which is short for ‘real-valued non-volume-preserving’. The idea is to par- 
tition the latent-variable vector z into two parts z = (ZA, Zg), so that if z has di- 
mension D and z4 has dimension d, then zg has dimension D — d. We similarly 
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Figure 18.1 A single layer of the real NVP nor- 
malizing flow model. Here the network NN1 
computes the function exp(s(za,w)) and the 
network NN2 computes the function b(z4, w). 
The output vector is then defined by (18.10) and 
(18.11). 


partition the output vector x = (x4,xXg) where x4 has dimension d and xg has 
dimension D — d. For the first part of the output vector, we simply copy the input: 


XA = ZA. (18.10) 


The second part of the vector undergoes a linear transformation, but now the coeffi- 
cients in the linear transformation are given by nonlinear functions of z 4: 


Xp = exp(s(z4, w)) © Zg + b(za, w) (18.11) 


where s(z4, w) and b(z,4, w) are the real-valued outputs of neural networks, and 
the exponential ensures that the multiplicative term is non-negative. Here © denotes 
the Hadamard product involving an element-wise multiplication of the two vectors. 
Similarly, the exponential in (18.11) is taken element-wise. Note that we have shown 
the same vector w in both network functions. In practice, these may be implemented 
as separate networks with their own parameters, or as one network with two sets of 
outputs. 

Due to the use of neural network functions, the value of xg can be a very flexible 
function of x4. Nevertheless, the overall transformation is easily invertible: given a 
value for x = (x4, Xg) we first compute 


ZA = XA, (18.12) 
then we evaluate s(z 4, w) and b(z 4, w), and finally we compute zp using 
ZB = exp(—s(za, w)) © (xg — b(z4,w)). (18.13) 


The overall transformation is illustrated in Figure 18.1. Note that there is no re- 
quirement for the individual neural network functions s(z4, w) and b(z,4, w) to be 
invertible. 

Now consider the evaluation of the Jacobian defined by (18.2) and its determi- 
nant. We can divide the Jacobian matrix into blocks, corresponding to the partition- 
ing of z and x, giving 


J= . (18.14) 
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Figure 18.2 By composing two layers of the form shown in Figure 18.1, we obtain a more flexible, but still 
invertible, nonlinear layer. Each sub-layer is invertible and has an easily evaluated Jacobian, and hence the 
overall double layer has the same properties. 


Appendix A 


The top left block corresponds to the derivatives of z 4 with respect to x 4 and hence 
from (18.12) is given by the d x d identity matrix. The top right block corresponds to 
the derivatives of z 4 with respect to x and these terms vanish, again from (18.12). 
The bottom left block corresponds to the derivatives of zg with respect to x4. From 
(18.13), these are complicated expressions involving the neural network functions. 
Finally, the bottom right block corresponds to the derivatives of zg with respect to 
xg, which from (18.13) are given by a diagonal matrix whose diagonal elements 
are given by the exponentials of the negative elements of s(z4, w). We therefore 
see that the Jacobian matrix (18.14) is a lower triangular matrix, meaning that all 
elements above the leading diagonal are zero. For such a matrix, the determinant 
is just the product of the elements along the leading diagonal, and therefore it does 
not depend on the complicated expressions in the lower left block. Consequently, 
the determinant of the Jacobian is simply given by the product of the elements of 
exp(—s(z4,W)). 

A clear limitation of this approach is that the value of z4 is unchanged by the 
transformation. This is easily resolved by adding another layer in which the roles 
of z 4 and zp are reversed, as illustrated in Figure 18.2. This double-layer structure 
can then be repeated multiple times to facilitate a very flexible class of generative 
models. 

The overall training procedure involves creating mini-batches of data points, in 
which the contribution of each data point to the log likelihood function is obtained 
from (18.4). For a latent distribution of the form N (z|0, I), the log density is simply 
—||z||?/2 up to an additive constant. The inverse transformation z = g(x) is cal- 
culated using a sequence of inverse transformations of the form (18.13). Similarly, 
the log of the Jacobian determinant is given by a sum of log determinants for each 
layer where each term is itself a sum of terms of the form —s;(x,w). Gradients of 
the log likelihood can be evaluated using automatic differentiation, and the network 
parameters updated by stochastic gradient descent. 

The real NVP model belongs to a broad class of normalizing flows called cou- 
pling flows, in which the linear transformation (18.11) is replaced by a more general 
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Illustration of the real NVP 
normalizing flow model ap- 
plied to the two-moons data 
set showing (a) the Gaus- 
sian base distribution, (b) 
the distribution after a trans- 
formation of the vertical axis 
only, (c) the distribution after 
a subsequent transformation 
of the horizontal axis, (d) the 
distribution after a second 
transformation of the vertical 
axis, (e) the distribution af- 
ter a second transformation 
of the horizontal axis, and 
(f) the data set on which the 
model was trained. 


form: 
xp = h(zg, g(za, w)) (18.15) 


where h(z pg, g) is a function of zg that is efficiently invertible for any given value of 
g and is called the coupling function. The function g(z4, w) is called a conditioner 
and is typically represented by a neural network. 

We can illustrate the real NVP normalizing flow using a simple data set, some- 
times known as ‘two moons’, as shown in Figure 18.3. Here a two-dimensional 
Gaussian distribution is transformed into a more complex distribution by using two 
successive layers each of which consists of alternate transformations on each of the 
two dimensions. 


Autoregressive Flows 


A related formulation of normalizing flows can be motivated by noting that the joint 
distribution over a set of variables can always be written as the product of conditional 
distributions, one for each variable. We first choose an ordering of the variables in 


Figure 18.4 


sampling. 
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Illustration of two al- 
ternative structures for autoregres- 
sive normalizing flows. The masked 
autoregressive flow shown in (a) al- 
lows efficient evaluation of the like- 
lihood function, whereas the alter- 
native inverse autoregressive flow 
shown in (b) allows for efficient 
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the vector x, from which we can write, without loss of generality, 


D 
plan s80) = | [ p(wilx1s-1) (18.16) 
i=l 
where xj.;_; denotes 7,,...,2;—1. This factorization can be used to construct a 


class of normalizing flow called a masked autoregressive flow, or MAF (Papamakar- 
ios, Pavlakou, and Murray, 2017), given by 


ty = h(zi, Bi gs Wy) (18.17) 


which is illustrated in Figure 18.4(a). Here h(z;,-) is the coupling function, which 
is chosen to be easily invertible with respect to z;, and g; is the conditioner, which 
is typically represented by a deep neural network. The term masked refers to the use 
of a single neural network to implement a set of equations of the form (18.17) along 
with a binary mask (Germain et al., 2015) that force a subset of the network weights 
to be zero to implement the autoregressive constraint (18.16). 
In this case the reverse calculations needed to evaluate the likelihood function 
are given by 
zi = ht (zi, gi(X1:i—1, Wa) (18.18) 


and hence can be performed efficiently on modern hardware since the individual 
functions in (18.18) needed to evaluate 21, ..., zp can be evaluated in parallel. The 
Jacobian matrix corresponding to the set of transformations (18.18) has elements 
zi /Əxj, which form an upper-triangular matrix whose determinant is given by the 
product of the diagonal elements and can therefore also be evaluated efficiently. 
However, sampling from this model must be done by evaluating (18.17), which is 
intrinsically sequential and therefore slow because the values of 71,...,2%;—1 must 
be evaluated before x; can be computed. 

To avoid this inefficient sampling, we can instead define an inverse autoregres- 
sive flows, or IAF (Kingma et al., 2016), given by 


£i = h(zi, Si Zi Wi) (18.19) 
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as illustrated in Figure 18.4(b). Sampling is now efficient since, for a given choice 
of z, the evaluation of the elements 71,...,2%p using (18.19) can be performed in 
parallel. However, the inverse function, which is needed to evaluate the likelihood, 
requires a series of calculations of the form 


zi = h* (xi, Bi (t1u—-1, Wi)), (18.20) 


which are intrinsically sequential and therefore slow. Whether a masked autoregres- 
sive flow or an inverse autoregressive flow is preferred will depend on the specific 
application. 

We see that coupling flows and autoregressive flows are closely related. Al- 
though autoregressive flows introduce considerable flexibility, this comes with a 
computational cost that grows linearly in the dimensionality D of the data space 
due to the need for sequential ancestral sampling. Coupling flows can be viewed as 
a special case of autoregressive flows in which some of this generality is sacrificed 
for efficiency by dividing the variables into two groups instead of D groups. 


Continuous Flows 


The final approach to normalizing flows that we consider in this chapter will make 
use of deep neural networks defined in terms of an ordinary differential equation, or 
ODE. This can be thought of as a deep network with an infinite number of layers. 
We first introduce the concept of a neural ODE then we see how this can be applied 
to the formulation of a normalizing flow model. 


18.3.1 Neural differential equations 


We have seen that neural networks are especially useful when they comprise 
many layers of processing, and so we can ask what happens if we explore the limit 
of an infinitely large number of layers. Consider a residual network where each layer 
of processing generates an output given by the input vector with the addition of some 
parameterized nonlinear function of that input vector: 


zD — 2 + £(2, w) (18.21) 


where t = 1,...,7 labels the layers in the network. Note that we have used the 
same function at each layer, with a shared parameter vector w, because this allows 
us to consider an arbitrarily large number of such layers while keeping the number 
of parameters bounded. Imagine that we increase the number of layers while ensur- 
ing that the changes introduced at each layer become correspondingly smaller. In 
the limit, the hidden-unit activation vector becomes a function z(t) of a continuous 
variable t, and we can express the evolution of this vector through the network as a 
differential equation: 

dz(t) 

dt 

where t is often referred to as ‘time’. The formulation in (18.22) is known as a neural 
ordinary differential equation or neural ODE (Chen et al., 2018). Here ‘ordinary’ 


= f(z(t), w) (18.22) 
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Figure 18.5 Comparison of a conventional layered network with a neural differential equation. The di- 
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agram on the left corresponds to a residual network with five layers and shows trajectories 
for several starting values of a single scalar input. The diagram on the right shows the re- 
sult of numerical integration of a continuous neural ODE, again for several starting values 
of the scalar input, in which we see that the function is not evaluated at uniformly-spaced 
time intervals, but instead the evaluation points are chosen adaptively by the numerical 
solver and depend on the choice of input value. [From Chen et al. (2018) with permission.] 


means that there is a single variable t. If we denote the input to the network by 
the vector z(0), then the output z(T) is obtained by integration of the differential 
equation 


z(T) = L f(z(t), w) dt. (18.23) 


This integral can be evaluated using standard numerical integration packages. The 
simplest method for solving differential equations is Euler’s forward integration 
method, which corresponds to the expression (18.21). In practice, more powerful 
numerical integration algorithms can adapt their function evaluation to achieve. In 
particular, they can adaptively choose values of ¢ that typically are not uniformly 
spaced. The number of such evaluations replaces the concept of depth in a conven- 
tional layered network. A comparison of a standard layered neural network and a 
neural differential equation are shown in Figure 18.5. 


18.3.2 Neural ODE backpropagation 


We now need to address the challenge of how to train a neural ODE, that is how 
to determine the value of w by optimizing a loss function. Let us assume that we are 
given a data set comprising values of the input vector z(0) along with an associated 
output target vector and a loss function L(-) that depends on the output vector z(T). 
One approach would be to use automatic differentiation to differentiate through all 
of the operations performed by the ODE solver during the forward pass. Although 
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this is straightforward to do, it is costly from a memory perspective and is not op- 
timal in terms of controlling numerical error. Instead, Chen et al. (2018) treat the 
ODE solver as a black box and use a technique called the adjoint sensitivity method, 
which can be viewed as the continuous analogue of explicit backpropagation. Recall 
that backpropagation involves, for each data point, three successive phases: first a 
forward propagation to evaluate the activation vectors at each layer of the network, 
second the evaluation of the derivatives of the loss with respect to the activations at 
each layer starting at the output and propagating backwards through the network by 
exploiting the chain rule of calculus, and third the evaluation of the derivatives with 
respect to network parameters by forming products of activations from the forward 
pass and gradients from the backward pass. We will see that there are analogous 
steps when computing the gradients for a neural ODE. 
To apply backpropagation to neural ODEs, we define a quantity called the ad- 
joint given by 
he dL 
a= dz(t)” 


We see that a(T) corresponds to the usual derivative of the loss with respect to the 
output vector. The adjoint satisfies its own differential equation given by 


(18.24) 


da(t) 
dt 


= —a(t)"Vzf (z(t), w), (18.25) 


which is a continuous version of the chain rule of calculus. This can be solved by 
integrating backwards starting from a(T), which again can be done using a black- 
box ODE solver. In principle, this requires that we have stored the trajectory z(t) 
computed during the forward phase, which could be problematic as the inverse solver 
might wish to evaluate z(t) at different values of t compared to the forward solver. 
Instead we simply allow the backwards solver to recompute any required values of 
z(t) by integrating (18.22) alongside (18.25) starting with the output value z(T). 

The third step in the backpropagation method is to evaluate derivatives of the loss 
with respect to network parameters by forming appropriate products of activations 
and gradients. When a parameter value is shared across multiple connections in a 
network, the total derivative is formed from the sum of derivatives for each of the 
connections. For our neural ODE, in which the same parameter vector w is shared 
throughout the network, this summation becomes an integration over t, which takes 
the form 


T 
Vwh=- f a(t)’ Vw f (z(t), w) dt. (18.26) 
0 


The derivatives V,f in (18.25) and Vyf in (18.26) can be evaluated efficiently 
using automatic differentiation. Note that the above results can equally be applied to 
a more general neural network function f (z(t), t, w) that has an explicit dependence 
on t in addition to the implicit dependence through z(t). 

One benefit of neural ODEs trained using the adjoint method, compared to con- 
ventional layered networks, is that there is no need to store the intermediate results 
of the forward propagation, and hence the memory cost is constant. Furthermore, 
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neural ODEs can naturally handle continuous-time data in which observations occur 
at arbitrary times. If the error function L depends on values of z(t) other than the 
output value, then multiple runs of the reverse-model solver are required, with one 
run for each consecutive pair of outputs, so that the single solution is broken down 
into multiple consecutive solutions in order to access the intermediate states (Chen 
et al., 2018). Note that a high level of accuracy in the solver can be used during train- 
ing, with a lower accuracy, and hence fewer function evaluations, during inference 
in applications for which compute resources are limited. 


18.3.3 Neural ODE flows 


We can make use of a neural ordinary differential equation to define an alter- 
native approach to the construction of tractable normalizing flow models. A neural 
ODE defines a highly flexible transformation from an input vector z(0) to an output 
vector z(T) in terms of a differential equation of the form 


dz(t) 
dt 


= f(z(t), w). (18.27) 


If we define a base distribution over the input vector p(z(0)) then the neural ODE 
propagates this forward through time to give a distribution p(z(t)) for each value 
of t, leading to a distribution over the output vector p(z(T)). Chen et al. (2018) 
showed that for neural ODEs, the transformation of the density can be evaluated by 
integrating a differential equation given by 


dInp(z(t)) of 
ast (sa) (18.28) 


where Of /Oz represents the Jacobian matrix with elements 0f;/0z;. This integra- 
tion can be performed using standard ODE solvers. Likewise, samples from this 
density can be obtained by sampling from the base density p(z(0)), which is chosen 
to be a simple distribution such as a Gaussian, and propagating the values to the out- 
put by integrating (18.27) again using the ODE solver. The resulting framework is 
known as a continuous normalizing flow and is illustrated in Figure 18.6. Continuous 
normalizing flows can be trained using the adjoint sensitivity method used for neural 
ODEs, which can be viewed as the continuous time equivalent of backpropagation. 
Since (18.28) involves the trace of the Jacobian rather than the determinant, 
which arises in discrete normalizing flows, it might appear to be more computation- 
ally efficient. In general, evaluating the determinant of a D x D matrix requires 
O(D*) operations, whereas evaluating the trace requires O(D) operations. How- 
ever, if the determinant is lower diagonal, as in many forms of normalizing flow, 
then the determinant is the product of the diagonal terms and therefore also involves 
O(D) operations. Since evaluating the individual elements of the Jacobian matrix 
requires a separate forward propagation, which itself requires O(D) operations, eval- 
uating the trace or the determinant (for a lower triangular matrix) takes O(D?) op- 
erations overall. However, the cost of evaluating the trace can be reduced to O(D) 
by using Hutchinson’s trace estimator (Grathwohl et al., 2018), which for a matrix 
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Figure 18.6 Illustration of a continu- 
ous normalizing flow showing a simple p(z(T)) 
Gaussian distribution at t = 0 that is 
continuously transformed into a multi- 
modal distribution at t = T. The flow 
lines show how points along the z-axis 
evolve as a function of t. Where the 
flow lines spread apart the density is re- 
duced, and where they move together 
the density is increased. 


A takes the form 


Tr(A) = Ee [e" Ae] (18.29) 


where € is a random vector whose distribution has zero mean and unit covariance, for 
example, a Gaussian M (0, I). For a specific €, the matrix-vector product Ae can be 
evaluated efficiently in a single pass using reverse-mode automatic differentiation. 
We can then approximate the trace using a finite number of samples in the form 


M 
1 
BAJS XO en Aem. (18.30) 
m=1 


In practice we can set M = 1 and just use a single sample, which is refreshed for 
each new data point. Although this is a noisy estimate, this might not be too signifi- 
cant since it forms part of a noisy stochastic gradient descent procedure. Importantly 
Exercise 18.11 itis unbiased, meaning that the expectation of the estimator is equal to the true value. 
Significant improvements in training efficiency for continuous normalizing flows 
can be achieved using a technique called flow matching (Lipman et al., 2022). This 
Chapter 20 brings normalizing flows closer to diffusion models and avoids the need for back- 
propagation through the integrator while significantly reducing memory require- 
ments and enabling faster inference and more stable training. 
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(xx) Consider a transformation x = f(z) along with its inverse z = g(x). By 
differentiating x = f(g(x)), show that 


JK =I (18.31) 
where I is the identity matrix, and J and K are matrices with elements 
Og: Ofi 
dg = : Kij = i 18.32 
Ox; Oz; ( ) 


Using the result that the determinant of a product of matrices is the product of their 


determinants, show that 


1 


Hence, show that the formula (18.1) for the transformation of a density under a 
change of variables can be rewritten as 


Px(X) = p2(g(x)) [det K|~* (18.34) 
where K is evaluated at z = g(x). 


(x) Consider a sequence of invertible transformations of the form 


x = fi (fo(--- fy_1 (fis (z)) «++ )). (18.35) 
Show that the inverse function is given by 
z= fy (fyi E (&)) -- +) (18.36) 


(x) Consider a linear change of variables of the form 
x=Zz+hb. (18.37) 


Show that the Jacobian of this transformation is the identity matrix. Interpret this 
result by comparing the volume of a small region of z-space with the volume of the 
corresponding region of x-space. 


(x x) Show that the Jacobian of the autoregressive normalizing flow transformation 
given by (18.18) is a lower triangular matrix. The determinant of such a matrix is 
given by the product of the terms on the leading diagonal and is therefore easily 
evaluated. 


(x) Consider the forward propagation equation for a residual network given by (18.21) 
in which we consider a small increment € in the ‘time’ variable t: 


gts) = 2) 4 ef(z®, w). (18.38) 


Here the additive contribution from the neural network is scaled by e. Note that 
(18.21) corresponds to the case e = 1. By taking the limit e — 0, derive the forward 
propagation differential equation given by (18.22). 
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18.10 


(x x) In this exercise and the next we provide an informal derivation of the backpropa- 
gation and gradient evaluation equations for a neural ODE. A more formal derivation 
of these results can be found in Chen et al. (2018). Write down the backpropagation 
equation corresponding to the forward equation (18.38). By taking the limit € — 0, 
derive the backward propagation equation (18.25), where a(t) is defined by (18.24). 


(x x) By making use of the result (8.10), write down an expression for the gradient 
of a loss function L(z(T)) for a multilayered residual network defined by (18.38) 
in which all layers share the same parameter vector w. By taking the limit €e > 0, 
derive the equation (18.26) for the derivative of the loss function. 


(x x x) In this exercise we give an informal derivation of (18.28) for one-dimensional 
distributions. Consider a distribution q(z) at time t that is transformed to a new 
distribution p(x) at time t + ôt as a result of a transformation from z to x. Also 
consider nearby values z and z + Az along with corresponding values x and x + 
Az as shown in Figure 18.7. First, write down an equation that expresses that the 
probability mass in the interval Az is the same as that in the interval Ax. Second, 
write down an equation that shows how the probability density changes in going 
from ¢ to t + ôt, expressed in terms of the derivative dq(t)/ dt. Third, write down an 
equation for Az in terms of Az by introducing the function f(z) = dz/ dt. Finally, 
by combining these three equations and taking the limit ôt — 0, show that 


d 
q male) = -f'(2), (18.39) 


which is the one-dimensional version of (18.28). 


Schematic illustration of the 


transformation of probability den- pla) 

sities used to derive the equation a+ Ag 

for continuous normalizing flows t+ dt 
in one dimension. 


Z z+ Az 


(xx) The flow lines in Figure 18.6 were plotted by taking a set of equally spaced 
values and using the inverse of the cumulative distribution function at each value of t 
to plot the corresponding points in z-space. Show that this is equivalent to using the 
differential equation (18.27) to compute the flow lines where f is defined by (18.28). 


(xx) Using the differential equation (18.27) write down an expression for the base 
density of a continuous normalizing flow in terms of the output density, expressed 
as an integral over t. Hence, by making use of the fact that changing the sign of a 
definite integral is equivalent to swapping the limits on that integral, show that the 
computational cost of inverting a continuous normalizing flow is the same as that 
needed to evaluate the forward flow. 
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18.11 (x) Show that the expectation of the right-hand side in the Hutchinson trace estimator 
(18.30) is equal to Tr(A) for any value of M. This shows that the estimator is 
unbiased. 


Check for 
updates 
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Autoencoders 


A central goal of deep learning is to discover representations of data that are useful 
for one or more subsequent applications. One well-established approach to learn- 
ing internal representations is called the auto-associative neural network or autoen- 
coder. This consists of a neural network having the same number of output units as 
inputs and which is trained to generate an output y that is close to the input x. Once 
trained, an internal layer within the neural network gives a representation z(x) for 
each new input. Such a network can be viewed as having two parts. The first is an 
encoder, which maps the input x into a hidden representation z(x), and the second 
is a decoder, which maps the hidden representation onto the output y(z). 

If an autoencoder is to find non-trivial solutions, it is necessary to introduce 
some form of constraint, otherwise the network can simply copy the input values 
to the outputs. This constraint might be achieved, for example, by restricting the 
dimensionality of z relative to that of x or by requiring z to have a sparse represen- 
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19.1. 


tation. Alternatively, the network can be forced to discover non-trivial solutions by 
modifying the training process such that the network has to learn to undo corrup- 
tions to the input vectors such as additive noise or missing values. These kinds of 
constraint encourage the network to discover interesting structure within the data to 
achieve good training performance. 

In this chapter, we start with deterministic autoencoders and then later gener- 
alize to stochastic models that learn an encoder distribution p(z|x) together with a 
decoder distribution p(y|z). These probabilistic models are known as variational 
autoencoders and represent the third of our four approaches to learning nonlinear 
latent variable models. 


Deterministic Autoencoders 


We encountered a simple form of autoencoder when we studied principal compo- 
nent analysis (PCA). This is a model that makes a linear transformation of an input 
vector onto a lower dimensional manifold, and the resulting projection can be ap- 
proximately reconstructed back in the original data space, again through a linear 
transformation. We can make use of the nonlinearity of neural networks to define a 
form of nonlinear PCA in which the latent manifold is no longer a linear subspace 
of the data space. This is achieved by using a network having the same number of 
outputs as inputs and by optimizing the weights so as to minimize some measure of 
the reconstruction error between inputs and outputs with respect to a set of training 
data. 

Simple autoencoders are rarely used directly in modern deep learning, as they 
do not provide semantically meaningful representations in the latent space and they 
are not able directly to generate new examples from the data distribution. However, 
they provide an important conceptual foundation for some of the more powerful deep 
generative models such as variational autoencoders. 


19.1.1 Linear autoencoders 


Consider first a multilayer perceptron of the form shown in Figure 19.1, having 
D inputs, D output units, and M hidden units, with M < D. The targets used 
to train the network are simply the input vectors themselves, so that the network 
attempts to map each input vector onto itself. Such a network is said to form an auto- 
associative mapping. Since the number of hidden units is smaller than the number 
of inputs, a perfect reconstruction of all input vectors is not in general possible. We 
therefore determine the network parameters w by minimizing an error function that 
captures the degree of mismatch between the input vectors and their reconstructions. 
In particular, we choose a sum-of-squares error of the form 


1 
E(w) = 5 >) lly(Xn, w) — xnll?- (19.1) 


If the hidden units have linear activation functions, then it can be shown that the 
error function has a unique global minimum and that at this minimum the network 
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Figure 19.1 An autoencoder neural network having two inputs outputs 
layers of weights. Such a network is trained 
to map input vectors onto themselves by 
minimizing a sum-of-squares error. Even 
with nonlinear units in the hidden layer, 
such a network is equivalent to linear prin- 
cipal component analysis. Links represent- 
ing bias parameters have been omitted for 
clarity. 


performs a projection onto the /-dimensional subspace that is spanned by the first 
M principal components of the data (Bourlard and Kamp, 1988; Baldi and Hornik, 
1989). Thus, the vectors of weights that lead into the hidden units in Figure 19.1 
form a basis set that spans the principal subspace. Note, however, that these vectors 
need not be orthogonal or normalized. This result is unsurprising, since both PCA 
and neural networks rely on linear dimensionality reduction and minimize the same 
sum-of-squares error function. 

It might be thought that the limitations of a linear manifold could be overcome 
by using nonlinear activation functions for the hidden units in the network in Fig- 
ure 19.1. However, even with nonlinear hidden units, the minimum error solution 
is again given by the projection onto the principal component subspace (Bourlard 
and Kamp, 1988). There is therefore no advantage in using two-layer neural net- 
works to perform dimensionality reduction. Standard techniques for PCA, based on 
singular-value decomposition (SVD), are guaranteed to give the correct solution in 
finite time, and they also generate an ordered set of eigenvalues with corresponding 
orthonormal eigenvectors. 


19.1.2 Deep autoencoders 


The situation is different, however, if additional nonlinear layers are included in 
the network. Consider the four-layer auto-associative network shown in Figure 19.2. 
Again, the output units are linear, and the M units in the second layer can also 
be linear. However, the first and third layers have sigmoidal nonlinear activation 
functions. The network is again trained by minimizing the error function (19.1). We 
can view this network as two successive functional mappings F; and Fg, as indicated 
in Figure 19.2. The first mapping F; projects the original D-dimensional data onto 
an M-dimensional subspace S defined by the activations of the units in the second 
layer. Because of the first layer of nonlinear units, this mapping is very general and 
is not restricted to being linear. Similarly, the second half of the network defines 
an arbitrary functional mapping from the 1/-dimensional hidden space back into the 
original D-dimensional input space. This has a simple geometrical interpretation, as 
indicated for D = 3 and M = 2 in Figure 19.3. 

Such a network effectively performs a nonlinear form of PCA. It has the ad- 
vantage of not being limited to linear transformations, although it contains standard 
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Figure 19.2 Adding extra hidden layers F; F> 


of nonlinear units produces an auto- inputs > > outputs 
associative network, which can perform a 


nonlinear dimensionality reduction. 


C nonlinear _f 


PCA as a special case. However, training the network now involves a nonlinear op- 
timization, since the error function (19.1) is no longer a quadratic function of the 
network parameters. Computationally intensive nonlinear optimization techniques 
must be used, and there is the risk of finding a sub-optimal local minimum of the 
error function. Also, the dimensionality of the subspace must be specified before 
training the network. 


19.1.3 Sparse autoencoders 


Instead of limiting the number of nodes in one of the hidden layers in the net- 
work, an alternative way to constrain the internal representation is to use a regularizer 
to encourage a sparse representation, leading to a lower effective dimensionality. 

Section 9.2.2 A simple choice is the L; regularizer since this encourages sparseness, giving a 


T2 Fə y2 


T3 Y3 


Figure 19.3 Geometrical interpretation of the mappings performed by the network in Figure 19.2 for a model 
with D = 3 inputs and M = 2 units in the second layer. The function Fə from the latent space defines the way 
in which the manifold S is embedded within the higher-dimensional data space. Since F2 can be nonlinear, the 
embedding of S can be non-planar, as indicated in the figure. The function F; then defines a projection from the 
original D-dimensional data space into the /-dimensional latent space. 
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regularized error function of the form 


K 
E(w) = E(w) +à X Izl (19.2) 
k=1 


where E(w) is the unregularized error, and the sum over k is taken over the acti- 
vation values of all the units in one of the hidden layers. Note that regularization 
is usually applied to the parameters of a network, whereas here it is being used on 
the unit activations. The derivatives required for gradient descent training can be 
evaluated using automatic differentiation, as usual. 


19.1.4 Denoising autoencoders 


We have seen the importance of constraining the dimensionality of the latent 
space layer in a simple autoencoder to avoid the model simply learning the identity 
mapping. An alternative approach, that also forces the model to discover interesting 
internal structure in the data, is to use a denoising autoencoder (Vincent et al., 2008). 
The idea is to take each input vector x,, and to corrupt it with noise to give a modified 
vector X,, which is then input to an autoencoder to give an output y(X,,,w). The 
network is trained to reconstruct the original noise-free input vector by minimizing 
an error function such as the sum-of squares given by 


N 
E(w) = » ly (Zn, w) — Xn||?. (19.3) 
n=l 


One form of noise involves setting a randomly chosen subset of the input variables 
to zero. The fraction v of such inputs represents the noise level, and lies in the range 
0 <v < 1. An alternative approach is to add independent zero-mean Gaussian 
noise to every input variable, where the scale of the noise is set by the variance of 
the Gaussian. By learning to denoise the input data, the network is forced to learn 
aspects of the structure of that data. For example, if the data comprises images, 
then learning that nearby pixel values are strongly correlated allows noise-corrupted 
pixels to be corrected. 

More formally, the training of denoising autoencoders is related to score match- 
ing (Vincent, 2011) where the score is defined by s(x) = V, ln p(x). Some intuition 
for this relationship is given in Figure 19.4. The autoencoder learns to reverse the 
distortion vector Xn — Xn and therefore learns a vector for each point in data space 
that points towards the manifold and therefore towards the region of high data den- 
sity. The score vector V In p(x) is similarly a vector pointing towards the region of 
high data density. We will explore the relationship between score matching and de- 
noising in more depth when we discuss diffusion models which also learn to remove 
noise from noise-corrupted inputs. 


19.1.5 Masked autoencoders 


We have seen that transformer models such as BERT can learn rich internal 
representations of natural languages through self-supervision by masking random 
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Figure 19.4 
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In a denoising autoencoder, data : 
points, which are assumed to corrupted data point 
live on a lower-dimensional man- 
ifold in data space, are corrupted 
with additive noise. The autoen- 
coder learns to map corrupted 
data points back to their original 
values and therefore learns a vec- 
tor for each point in data space 
that points towards the manifold. 


original 
data point 


data manifold 


subsets of the inputs, and it is natural to ask if a similar approach can be applied to 
natural images. In a masked autoencoder (He et al., 2021), a deep network is used 
to reconstruct an image given a corrupted version of that image as input, similar to 
denoising autoencoders. However in this case, the form of corruption is masking, or 
dropping out, part of the input image. This technique is generally used in combina- 
tion with a vision transformer architecture, as in this case, masking part of the input 
can be easily implemented by passing only a subset of randomly selected input patch 
tokens to the encoder. The overall algorithm is summarized in Figure 19.5. 

Compared to language, images have much more redundancy along with strong 
local correlations. Omitting a single word from a sentence can greatly increase am- 
biguity whereas removing a random patch from an image typically has little impact 
on the semantics of the image. Unsurprisingly, the best internal representations are 
learned when a relatively high proportion of the input image is masked, typically 
75% compared with the 15% masking for BERT. In BERT the masked inputs are 
replaced by a fixed mask token, whereas in the masked autoencoder the masked 
patches are simply omitted. By omitting a large fraction of the input patches, we can 
save significant computation, particularly as the computation required for a training 
instance of a transformer scales poorly with input sequence length, thus making the 
masked autoencoder a good choice for pre-training large transformer encoders. 

As the decoder layer is also a transformer, it needs to work in the dimensionality 
of the original image. Since the output of a transformer has the same dimensionality 
as the input, we need to restore the image dimensionality between the output of the 
encoder and the input of the decoder. This is achieved by reinstating the masked 
patches, represented by a fixed mask token vector, with each patch token augmented 
by positional encoding information. Due to the much higher dimensionality of the 
decoder representation, the decoder transformer has far fewer learnable parameters 
than the encoder. The output of the decoder is followed by a learnable linear layer 
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Figure 19.5 Architecture of a masked autoencoder during the training phase. Note that the target is the com- 
plement of the input as the loss is only applied on masked patches. After training, the decoder is discarded and 
the encoder is used to map images to an internal representation for use in downstream tasks. 


19.2. 


that maps the output representation into the space of pixel values, and the training 
error function is simply the mean squared error averaged over the missing patches 
for each image. Examples of images reconstructed by a trained masked autoencoder 
are shown in Figure 19.6 and demonstrate the ability of a trained autoencoder to 
generate semantically plausible reconstructions. However, the ultimate goal is to 
learn useful internal representations for subsequent downstream tasks, for which the 
decoder is discarded and the encoder is applied to the full image with no masking 
and with a fresh set of output layers that are fine-tuned for the required application. 
Note also that although this algorithm was initially designed for image data, it can in 
theory be applied to any modality. 


Variational Autoencoders 


We have already seen that the likelihood function for a latent-variable model given 
by 


p(x|w) = / p(x|z, w)p(2) dz, (19.4) 


in which p(x|z, w) is defined by a deep neural network, is intractable because the 
integral over z cannot be evaluated analytically. The variational autoencoder, or 
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Figure 19.6 Four examples of images reconstructed using a trained masked autoencoder, in which 80% of 
the input patches are masked. In each case the masked image is on the left, the reconstructed image is in the 
centre, and the original image is on the right. [From He et al. (2021) with permission.] 
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VAE (Kingma and Welling, 2013; Rezende, Mohamed, and Wierstra, 2014; Doer- 
sch, 2016; Kingma and Welling, 2019) instead works with an approximation to this 
likelihood when training the model. There are three key ideas in the VAE: (i) use of 
the evidence lower bound (ELBO) to approximate the likelihood function, leading to 
a close relationship to the EM algorithm, (ii) amortized inference in which a second 
model, the encoder network, is used to approximate the posterior distributions over 
latent variables in the E step, rather than evaluating the posterior distribution for each 
data point exactly, and (iii) making the training of the encoder model tractable using 
the reparameterization trick. 

Consider a generative model with a conditional distribution p(x|z, w) over the 
D-dimensional data variable x governed by the output of a deep neural network 
g(z,w). For example, g(z, w) might represent the mean of a Gaussian conditional 
distribution. Also, consider a distribution over the (/-dimensional latent variable z 
that is given by a zero-mean unit-variance Gaussian: 


p(z) = N (z|0, 1). (19.5) 


To derive the VAE approximation, first recall that, for an arbitrary probability distri- 
bution q(z) over a space described by the latent variable z, the following relationship 
holds: 

In p(x|w) = L(w) + KL (q(z)||p(z|x, w)) (19.6) 


where £ is the evidence lower bound, or ELBO, also known as the variational lower 


bound, given by 
L(w) = [aw In [eere | dz (19.7) 
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and the Kullback—Leibler divergence KL (-||-) is defined by 


KL (q(2)||p(elx, w)) = — f q(2) in { Me) dz. (19.8) 


Because the Kullback—Leibler divergence satisfies KL (q||p) > 0, it follows that 
In p(x|w) > £ (19.9) 


and so £ is a lower bound on In p(x|w). Although the log likelihood In p(x|w) is 
intractable, we will see how the lower bound can be evaluated using a Monte Carlo 
estimate. Hence it provides an approximation to the true log likelihood. 

Now consider a set of training data points D = {x,,...,x,}, which are as- 
sumed to be drawn independently from the model distribution p(x). The log likeli- 
hood function for this data set is given by 


In p(D|w) = 2 + SOKL( Qn(Zn)||P(Zn|Xn, W (19.10) 
n=1 
where 
L, = | onlen) nf Pela e | din (19.11) 
dn (Zn) 


Note that this introduces a separate latent variable Z„ corresponding to each data 
vector Xn, aS we saw with mixture models and with the probabilistic PCA model. 
Consequently, each latent variable has its own independent distribution qn (Zn), each 
of which can be optimized separately. 

Since (19.10) holds for any choice of the distributions qn (z), we can choose the 
distributions that maximize the bound £n, or equivalently the distributions that min- 
imize the Kullback—Leibler divergences KL (qn (Zp)||p(Zn|Xn,W)). For the simple 
Gaussian mixture and probabilistic PCA models considered previously, we were able 
to evaluate these posterior distributions exactly in the E step of the EM algorithm, 
which corresponds to setting each qn (Zn) equal to the corresponding posterior dis- 
tribution p(Z,|xn, w). This gives zero Kullback—Leibler divergence, and hence the 
lower bound is equal to the true log likelihood. The interpretation of the posterior 
distribution is illustrated in Figure 19.7 using the simple example introduced earlier 
in the context of generative adversarial networks. 

The exact posterior distribution of Zn is given from Bayes’ theorem by 


PlZn|Xn, w) = P(Xnl2n, W)p(2n). (19.12) 
P(Xn|w) 


The numerator is straightforward to evaluate for our deep generative model. How- 
ever, we see that the denominator is given by the likelihood function, which as we 
have already noted, is intractable. We therefore need to find an approximation to 
the posterior distribution. In principle, we could consider a separate parameterized 
model for each of the distributions qn(Zn) and optimize each model numerically, 
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(a) (b) 


Figure 19.7 Evaluation of the posterior distribution for the same model as shown in Figure 16.13. The marginal 
distribution p(x), shown in the right-most plot in (b), has a banana shape, and the specific data point x* is closer 
to the horns of the shape than to the middle. Consequently the posterior distribution p(z|x*), shown in (a), is 
bimodal, even though the prior distribution p(z) is unimodal. [Based on Prince (2020) with permission.] 
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but this would be computationally very expensive, especially for large data sets, and 
moreover we would have to re-evaluate the distributions after every update of w. 
Instead, we turn now to a different, more efficient approximation framework based 
on the introduction of a second neural network. 


19.2.1 Amortized inference 


In the variational autoencoder, instead of trying to evaluate a separate posterior 
distribution p(Zn|Xn, w) for each of the data points xn individually, we train a single 
neural network, called the encoder network, to approximate all these distributions. 
This technique is called amortized inference and requires an encoder that produces 
a single distribution q(z|x, œ) that is conditioned on x, where @ represents the pa- 
rameters of the network. The objective function, given by the evidence lower bound, 
now has a dependence on ¢ as well as on w, and we use gradient-based optimization 
methods to maximize the bound jointly with respect to both sets of parameters. 

A VAE therefore comprises two neural networks that have independent param- 
eters but which are trained jointly: an encoder network that takes a data vector and 
maps it to a latent space, and the original network that takes a latent space vector and 
maps it back to the data space and which we can therefore interpret as a decoder net- 
work. This like the simple neural network autoencoder model, except that we now 
define a probability distribution over the latent space. We will see that the encoder 
calculates an approximate probabilistic inverse of the decoder according to Bayes’ 
theorem. 

A typical choice for the encoder is a Gaussian distribution with a diagonal co- 
variance matrix whose mean and variance parameters, uj and o$, are given by the 
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Figure 19.8 Illustration of the optimization of the ELBO (evidence lower bound). (a) For a given value wo of 
the decoder network parameters w, we can increase the bound by optimizing the parameters œ of the encoder 
network. (b) For a given value of œ, we can increase the value of the ELBO function by optimizing w. Note that 
the ELBO function, shown by the blue curves, always lies somewhat below the log likelihood function, shown in 
red, because the encoder network is generally not able to match the true posterior distribution exactly. 


outputs of a neural network that takes x as input: 
M 
q(z|x, d) = WE (z;| 45 (x, p), 05 (x, ¢)). (19.13) 
j=l 


Note that the means u;(x, œ) lie in the range (—oo, 00), and so the corresponding 
output-unit activation functions can be linear, whereas the variances o5(x, p) must 
be non-negative and so the associated output units typically use exp(-) as their acti- 
vation function. 

The goal is to use gradient-based optimization to maximize the bound with re- 
spect to both sets of parameters @ and w, typically by using stochastic gradient 
descent based on mini-batches. Although we optimize the parameters jointly, con- 
ceptually we could imagine alternating between optimizing @ and optimizing w, in 
the spirit of the EM algorithm, as illustrated in Figure 19.8. 

A key difference compared to EM is that, for a given value of w, optimizing with 
respect to the parameters @ of the encoder does not in general reduce the Kullback- 
Leibler divergence to zero, because the encoder network is not a perfect predictor 
of the posterior latent distribution and so there is a residual gap between the lower 
bound and the true log likelihood. Although the encoder is very flexible, since it 
is based on a deep neural network, it is not expected to model the true posterior 
distribution exactly because (i) the true conditional posterior distribution will not be 
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(a) (b) 


Figure 19.9 Comparison of the EM algorithm with ELBO optimization in a VAE. (a) In the EM algorithm we 
alternate between updating the variational posterior distribution in the E step, and the model parameters in the 
M step. When the E step is exact, the gap between the lower bound and the log likelinood is reduced to zero 
after each E step. (b) In the VAE we perform joint optimization of the encoder network parameters o (analogous 
to the E step) and the decoder network parameters w (analogous to the M step). 
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a factorized Gaussian, (ii) even a large neural network has limited flexibility, and (iii) 
the training process is only an approximate optimization. The relation between the 
EM algorithm and ELBO optimization is summarized in Figure 19.9. 


19.2.2 The reparameterization trick 


Unfortunately, as it stands, the lower bound (19.11) is still intractable to compute 
because it involves integrals over the latent variables {z,,} in which the integrand has 
a complicated dependence on the latent variables because of the decoder network. 
For data point x„ we can write the contribution to the lower bound in the form 


Calw, Ø) = | alanen 6) m ee = 


5 / a(2n|%n, $) In p(Xp|2ny W) dzn —KL(q(2nlXny )|lp(Zn))- (19.14) 


The second term on the right-hand side is a Kullback—Leibler divergence between 
two Gaussian distributions and can be evaluated analytically: 


M 
KL (a(@nlXn; $)[lP(@n)) = 5 X {1 + Ino?(n) = 13 (en) = 02(xn)} - (19.15) 
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Figure 19.10 When the ELBO is estimated by fixing the latent variable z to a sampled value this blocks 
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backpropagation of the error signal to the encoder network. 


For the first term in (19.14), we could try to approximate the integral over z,, with a 
simple Monte Carlo estimator: 


L 
J q(Zn|Xn, o) In p(Xn|Zn, w) dZn ~ Z >, In p(xn|z®, w) (19.16) 


where {z\!)} are samples drawn from the encoder distribution q(zn|Xn,). This is 
easily differentiated with respect to w, but the gradient with respect to @ is problem- 
atic because changes to œ will change the distribution q(Zn|Xn, @) from which the 
samples are drawn and yet these samples are fixed values so that we do not have a 
way to obtain the derivatives of these samples with respect to @. Conceptually, we 
can think of the process of fixing z,, to a specific sample value as blocking the back- 
propagation of the error signal to the encoder network, as illustrated in Figure 19.10. 

We can resolve this by making use of the reparameterization trick in which we 
reformulate the Monte Carlo sampling procedure such that derivatives with respect 
to @ can be calculated explicitly. First, note that if € is a Gaussian random variable 
with zero mean and unit variance, then the quantity 


z=oe+p (19.17) 


will have a Gaussian distribution, with mean ju and variance 0”. We now apply this 
to the samples in (19.16) in which u and ø are defined by the outputs u;(Xn, Q) 
and o3(Xn, p) of the encoder network, which represent the means and variances in 
distribution (19.13). Instead of drawing samples of z,, directly, we draw samples for 
c and use (19.17) to evaluate corresponding samples for Zn: 


zia = Hj (Xn, oye) + o3 (Xn, P) (19.18) 
where | = 1,..., L indexes the samples. This makes the dependence on ¢ explicit 


and allows gradients with respect to @ to be evaluated, as illustrated in Figure 19.11. 
The reparameterization trick can be extended to other distributions but is limited to 
continuous variables. There are techniques to evaluate gradients directly without the 
reparameterization trick (Williams, 1992), but these estimators have high variance, 
and so reparameterization can also be viewed as a variance reduction technique. 

The full error function for the VAE, using our specific modelling assumptions, 
therefore becomes 


M L 
1 1 
L= J 5 J {14 In pj Haj Onaj i T J Inp(xn|z®,w)> (19.19) 
n j=1 l=1 
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Figure 19.11 The reparameterization trick replaces a direct sample of z by one that is calculated from a 
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sample of an independent random variable e, thereby allowing the error signal to be back- 
propagated to the encoder network. The resulting model can be trained using gradient- 
based optimization to learn the parameters of both the encoder and decoder networks. 


where a) has components ae = Onje) + Unj, in which Unj = fj (Xn, Q) and 


Onj = 0;(Xn, P), and the summation over n in (19.19) is over the data points in a 
mini-batch. The number of samples L, for each data point x,,, is typically set to 1, so 
that only a single sample is used. Although this gives a noisy estimate of the bound, 
it forms part of the stochastic gradient optimization step, which is already noisy, and 
overall leads to more efficient optimization. 

We can summarize VAE training as follows. For each data point in a mini- 
batch, forward propagate through the encoder network to evaluate the means and 
variances of the approximate latent distribution, sample from this distribution using 
the reparameterization trick, and then propagate these samples through the decoder 
network to evaluate the ELBO (19.19). The gradients with respect to w and @ are 
then evaluated using automatic differentiation. VAE training is summarized in Al- 
gorithm 19.1, where, for clarity, we have omitted that this would generally be done 
using mini-batches. Once the model is trained, the encoder network is discarded and 
new data points are generated by sampling from the prior p(z) and forward propa- 
gating through the decoder network to obtain samples in the data space. 

After training we might want to assess how well the model represents a new test 
point x. Since the log likelihood is intractable, we can use the lower bound £ as an 
approximation. To estimate this we can sample from q(z|X, œ) as this gives more 
accurate estimates than sampling from p(z). 

There are many variants of VAEs. When applied to image data, the encoder is 
typically based on convolutions and the decoder based on transpose convolutions. 
In a conditional VAE both the encoder and decoder take a conditioning variable c 
as an additional input. For example, we might want to generate images of objects, 
in which c represents the object class. The latent-space prior distribution p(z) can 
again be a simple Gaussian, or it can be extended to a conditional distribution p(z|c) 
given by another neural network. Training and testing proceed as before. 

Note that the first term in the ELBO (19.14) encourages the encoder distribution 
q(z|x, @) to be close to the prior p(z), and so the decoder model is encouraged to 
produce realistic outputs when the trained model is run generatively by sampling 
from p(z). When training VAEs, a problem can arise in which the variational dis- 
tribution q(z|x, @) converges to the prior distribution p(z) and therefore becomes 
uninformative because it no longer depends on x. In effect the latent code is ig- 
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Algorithm 19.1: Variational autoencoder training 


Input: Training data set D = {x1,...,xn} 
Encoder network {11;(Xn,@),05(%n,@)}, JE {1,...,M} 
Decoder network g(z, w) 
Initial weight vectors w, d 
Learning rate 7 


Output: Final weight vectors w, @ 


repeat 
L£L<-0 


for j € {1,..., M} do 

Enj ~ N (0,1) 

Znj — Hj (Xn, )Enj + 03 (Xn, P) 

Re aa mo m be = oe 
end for 


L 4 L +lnp(xn|Zn, w) 


w| w+nNVwL // Update decoder weights 
p go+ nV 6£ // Update encoder weights 


until converged 
return w, @ 


nored. This is known as posterior collapse. A symptom of this is that if we take 
an input and encode it and then decode it, we get a poor reconstruction that looks 
blurry. In this case the Kullback—Leibler divergence KL(q(z|x, @)||p(z)) is close to 
zero. 

A different problem occurs when the latent code is not compressed, which is 
characterized by highly accurate reconstructions, but such that outputs generated by 
sampling p(z) and passing the samples through the decoder network have poor qual- 
ity and do not resemble the training data. In this case the Kullback—Leibler diver- 
gence is relatively large, and because the trained system has a variational distribution 
that is very different from the prior, samples from the prior do not generate realistic 
outputs. 

Both problems can be addressed by introducing a coefficient £ in front of the first 
term in (19.14) to control the regularization effectiveness of the Kullback—Leibler 
divergence, where typically 6 > 1 (Higgins et al., 2017). If the reconstructions 
look poor then 8 can be increased, whereas if the samples look poor then ( can be 
decreased. The value of can also be set to follow an annealing schedule in which 
it starts with a small value and is gradually increased during training. 

Finally, note that we have considered a decoder network g(z, w) that represents 
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Exercises 
19.1 


19.2 


19.3 


the mean of a Gaussian output distribution. We can extend the VAE to include out- 
puts representing the variance of the Gaussian or, more generally, the parameters that 
characterize other more complex distributions. 


(x x) Show that, for any distribution q(z|ġ) and any function G(z), the following 
relation holds: 


Vo | aleid)GC) dz = [ sle)G@)Vonalei6) dz. (19.20) 


Hence, show that the left-hand side of (19.20) can be approximated by the following 
Monte Carlo estimator: 


Vo | al2\8)G() dz ~ 2, G(z)V ln q(z |p) (19.21) 


where the samples {z‘” } are drawn independently from the distribution q(z|@). Ver- 
ify that this estimator is unbiased, i.e., that the average value of the right-hand side 
of (19.21), averaged over the distribution of the samples, is equal to the left-hand 
side. In principle, by setting G(z) = p(x|z, w), this result would allow the gradient 
of the second term on the right-hand side of (19.14) with respect to @ to be evalu- 
ated without making use of the reparameterization trick. Also, because this method 
is unbiased, it will give the exact answer in the limit of an infinite number of sam- 
ples. However, the reparameterization trick is more efficient, meaning that fewer 
samples are needed to get good accuracy, because it directly computes the change of 
p(x|z, w) due to the change in z that results from a change in @. 


(x) Verify that if e has a zero-mean unit-variance Gaussian distribution, then the 
variable z in (19.17) will have a Gaussian distribution with mean p and variance o°. 


(x x) In this exercise we extend the diagonal covariance VAE encoder network (19.13) 
to one with a general covariance matrix. Consider a k-dimensional random vector 
drawn from a simple Gaussian: 


e ~ N (z|0, 1I), (19.22) 
which is then linearly transformed using the relation 


where L is a lower-triangular matrix (i.e., a K x K matrix with all elements above 
the leading diagonal being zero). Show that z has a distribution N (z| u, ©), and 
write down an expression for X in terms of L. Explain why the diagonal elements of 
L must be non-negative. Describe how p and L can be expressed as the outputs of a 
neural network, and discuss suitable choices for output-unit activation functions. 
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19.4 (xx) Evaluate the Kullback—Leibler divergence term in (19.14). Hence, show how 
the gradients of this term with respect to w and @ can be evaluated for training the 
encoder and decoder networks. 


19.5 (x) We have seen that the ELBO given by (19.11) can be written in the form (19.14). 
Show that it can also be written as 


Ln(w, 8) = | azlan Ø) In {P(%nlzn, W)P(en)} den 
= | alzn)2co+8) Inge 260, 8) dz: (19.24) 
19.6 (x) Show that the ELBO given by (19.11) can be written in the form 
£n(w, 8) = | alzan: 8) In p(n) den 


P(Xn|Zn, w) 
+ | denten) m | ne ) dZn. (19.25) 
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Diffusion 
Models 


We have seen that a powerful way to construct rich generative models is to introduce 
a distribution p(z) over a latent variable z, and then to transform z into the data space 
x using a deep neural network. It is sufficient to use a simple, fixed distribution 
for p(z), such as a Gaussian M (z|0, I), since the generality of the neural network 
transforms this into a highly flexible family of distributions over x. In previous 
chapters we have explored several models which fit within this framework but which 
take different approaches to defining and training the deep neural network, based on 
generative adversarial networks, variational autoencoders, and normalizing flows. 
In this chapter we discuss a fourth class of models within this general frame- 
work, known as diffusion models, also called denoising diffusion probabilistic mod- 
els, or DDPMs (Sohl-Dickstein et al., 2015; Ho, Jain, and Abbeel, 2020), which 
have emerged as the state of the art for many applications. For illustration we will 
focus on models of image data although the framework has much broader applicabil- 
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Figure 20.1 Illustration of the encoding process in a diffusion model showing an image x that is gradually 
corrupted with multiple stages of additive Gaussian noise giving a sequence of increasingly noisy images. After 
a large number T of steps the result is indistinguishable from a sample drawn from a Gaussian distribution. A 
deep neural network is then trained to reverse this process. 
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ity. The central idea is to take each training image and to corrupt it using a multi-step 
noise process to transform it into a sample from a Gaussian distribution. This is il- 
lustrated in Figure 20.1. A deep neural network is then trained to invert this process, 
and once trained the network can then generate new images starting with samples 
from a Gaussian as input. 

Diffusion models can be viewed as a form of hierarchical variational autoen- 
coder in which the encoder distribution is fixed, and defined by the noise process, 
and only the generative distribution is learned (Luo, 2022). They are easy to train, 
they scale well on parallel hardware, and they avoid the challenges and instabilities 
of adversarial training while producing results that have quality comparable to, or 
better than, generative adversarial networks. However, generating new samples can 
be computationally expensive due to the need for multiple forward passes through 
the decoder network (Dhariwal and Nichol, 2021). 


Forward Encoder 


Suppose we take an image from the training set, which we will denote by x, and 
blend it with Gaussian noise independently for each pixel to give a noise-corrupted 


image z; defined by 
Zı = y1 — 1x + y 161 (20.1) 


where €, ~ N (€1|0, I) and 3; < 1 is the variance of the noise distribution. The 
choice of coefficients V1 — 8; and \/; in (20.1) and (20.3) ensures that the mean of 
the distribution of z+ is closer to zero than the mean of z,_ and that the variance of z+ 
is closer to the unit matrix than the variance of z;_1. We can write the transformation 


(20.1) in the form 
q(zı|x) = N(z|/1 — 1x, 6:1). (20.2) 


We then repeat the process with additional independent Gaussian noise steps to give 
a sequence of increasingly noisy images z2,...,Z7. Note that in the literature on 
diffusion models, these latent variables are sometimes denoted x,,...,x7 and the 
observed variable is denoted x9. We use the notation of z for latent variables and x 
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q(Zt—1|Zt, x) 


p(zi—1|z:, w) 


Figure 20.2 A diffusion process represented as a probabilistic graphical model. The original image x is 
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shown by the shaded node, since it is an observed variable, whereas the noise-corrupted 
images zı, ..., zr are considered to be latent variables. The noise process is defined by 
the forward distribution q(z:|z:—1) and can be viewed as an encoder. Our goal is to learn a 
model p(z:—1|z:, w) that tries to reverse this noise process and which can be viewed as a 
decoder. As we will see later, the conditional distribution q(z:_1|z:, x) plays an important 
role in defining the training procedure. 


for the observed variable for consistency with the rest of the book. Each successive 


image is given by 
Ze = V1 — bizt- + y Gres (20.3) 


where e; ~ N (e€:|0, I). Again, we can write (20.3) in the form 


q(Zt|Ze—1) = N (zily 1— btZt—1, BI). (20.4) 


The sequence of conditional distributions (20.4) forms a Markov chain and can be 
expressed as a probabilistic graphical model as shown in Figure 20.2. The values of 
the variance parameters 3, € (0,1) are set by hand and are typically chosen such 
that the variance values increase through the chain according to a prescribed schedule 
such that 64 < b2 <...< Br. 


20.1.1 Diffusion kernel 


The joint distribution of the latent variables, conditioned on the observed data 
vector x, is given by 


t 


Q( Zi, +. zx) = q(z1|x) II q(Z-|Z7—1). (20.5) 
=z 
If we now marginalize over the intermediate variables z1,...,2Z,—1, we obtain the 
diffusion kernel: 
q(Zz|x) = N (z| Varx, (1 — az)I) (20.6) 


where we have defined : 


œ = | [C - b). (20.7) 


Tah 
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We see that each intermediate distribution has a simple closed-form Gaussian ex- 
pression from which we can directly sample, which will prove useful when training 
DDPMs as it allows efficient stochastic gradient descent using randomly chosen in- 
termediate terms in the Markov chain without having to run the whole chain. We can 
also write (20.6) in the form 


Zt = VX + V1 — are (20.8) 


where again €, ~ N (e€+|0, I). Note that that € now represents the total noise added to 
the original image instead of the incremental noise added at this step of the Markov 
chain. 

After many steps the image becomes indistinguishable from Gaussian noise, and 
in the limit T — oo we have 


q(zr|x) = N (zr|0, I) (20.9) 


and therefore all information about the original image is lost. The choice of coef- 
ficients v1 — 6: and \/G; in (20.3) ensures that once the Markov chain converges 
to a distribution with zero mean and unit covariance, further updates will leave this 
unchanged. 

Since the right-hand side of (20.9) is independent of x, it follows that the marginal 
distribution of zr is given by 


q(zr) = N(zr|0, I). (20.10) 


It is common to refer to the Markov chain (20.4) as the forward process, and it is 
analogous to the encoder in a VAE, except that here it is fixed rather than learned. 
Note, however, that the usual terminology in the literature is the opposite of that 
typically used in the literature regarding normalizing flows, where the mapping from 
latent space to data space is considered the forward process. 


20.1.2 Conditional distribution 


Our goal is to learn to undo the noise process, and so it is natural to consider 
the reverse of the conditional distribution ¢(z;|z:~1), which we can express using 
Bayes’ theorem in the form 


(Ze |Ze—1)q(Ze—1) 


q(z-1|2:) = 4 (20.11) 
q(z:) 
We can write the marginal distribution q(Z+—1) in the form 
q(Zt-1) = f q(Zt_-1|X) p(x) dx (20.12) 


where q(z:—1|X) is given by the conditional Gaussian (20.6). This distribution is 
intractable, however, because we must integrate over the unknown data density p(x). 
If we approximate the integration using samples from the training data set, we obtain 
a complicated distribution expressed as a mixture of Gaussians. 
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Instead, we consider the conditional version of the reverse distribution, condi- 
tioned on the data vector x, defined by g(z:—1|Z, x), which as we will see shortly 
turns out to be a simple Gaussian distribution. Intuitively this is reasonable since, 
given a noisy image, it is difficult to guess which lower-noise image gave rise to it, 
whereas if we also know the starting image then the problem becomes much easier. 
We can calculate this conditional distribution using Bayes’ theorem: 


q(Zt|Ze—1, *)q(Zr—1|X) 


q(Ze—-1|2t,X) = (20.13) 
q(z:|x) 
We now make use of the Markov property of the forward process to write 
q(Zt|Zt-1, X) = g(Zt|Ze-1) (20.14) 


where the right-hand side is given by (20.4). As a function of z,_;, this takes the 
form of an exponential of a quadratic form. The term g(z:—1|x) in the numerator of 
(20.13) is the diffusion kernel given by (20.6), which again involves the exponential 
of a quadratic form with respect to z;_1. We can ignore the denominator in (20.13) 
since as a function of z,_, it is constant. Thus, we see that the right-hand side of 
(20.13) takes the form of a Gaussian distribution, and we can identify its mean and 
covariance using the technique of ‘completing the square’ to give 


q(Z+—1|Z4,X) = N (21 |m; (x, Zt), 0;1) (20.15) 
where 
Be ee ae - ii Aa li (20.16) 
— Ot 
o2 = fll — or-1) (20.17) 


1 — œ 


and we have made use of (20.7). 


Reverse Decoder 


We have seen that the forward encoder model is defined by a sequence of Gaussian 
conditional distributions q(z;|z,_1) but that inverting this directly leads to a distri- 
bution q(z,_1|Z;) that is intractable, as it would require integrating over all possible 
values of the starting vector x whose distribution is the unknown data distribution 
p(x) that we wish to model. Instead, we will learn an approximation to the reverse 
distribution by using a distribution p(z;_1|z,, w) governed by a deep neural network, 
where w represents the network weights and biases. This reverse step is analogous 
to the decoder in a variational autoencoder and is illustrated in Figure 20.2. Once 
the network is trained, we can sample from the simple Gaussian distribution over zr 
and transform it into a sample from the data distribution p(x) through a sequence of 
reverse sampling steps by repeated application of the trained network. 
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Figure 20.3 


Illustration of the evaluation of the reverse distribution q(z+—1|z+) using Bayes’ theorem (20.13) for 


scalar variables. The red curve on the right-hand plot shows the marginal distribution q(z+—1) illustrated using a 
mixture of three Gaussians, whereas the left-hand plot shows the Gaussian forward noise process q(z:|zt-1) as 
a distribution over z; centred on z:_1. By multiplying these together and normalizing, we obtain the distribution 
q(zt-1|24) shown for a particular choice of z; by the blue curve. Because the distribution on the left is relatively 
broad, corresponding to a large variance 6+, the distribution q(z+—1|z+) has a complex multimodal structure. 


E EA 
Zt—1 Zt 
Zt Zt—1 


Intuitively, if we keep the variances small so that 3, < 1 then the change in the 
latent vector between steps will be relatively small, and hence it should be easier to 
learn to invert the transformation. More specifically, if 6, < 1 then the distribution 
q(Zt—1|Zz) will be approximately a Gaussian distribution over z:_1. This can be 
seen from (20.11) since the right-hand side depends on z;_, through q(z,|z;_1) and 
q(Zı—1). If ¢(Z|z:_1) is a sufficiently narrow Gaussian then q(z;_1) will vary only 
a small amount over the region in which q(z,|z;_1) has significant mass, and hence 
q(Zt—-1|Zz) will also be approximately Gaussian. This intuition can be confirmed 
using a simple example as shown in Figures 20.3 and 20.4. However, since the 
variances at each step are small, we must use a large number of steps to ensure that 
the distribution over the final latent variable zp obtained from the forward noising 
process will still be close to a Gaussian, and this increases the cost of generating new 
samples. In practice, T may be several thousand. 

We can see more formally that q(z:—1|z,) will be approximately Gaussian by 


Figure 20.4 As in Figure 20.3 but in which the Gaussian distribution g(z¢|z:-1) in the left-hand plot has a much 
smaller variance 6+. We see that the corresponding distribution g(z:-1|zz) shown in blue on the right-hand plot 
is close to being Gaussian, with a similar variance to q(zt|zt-1). 
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making a Taylor series expansion of In q(z;_1|z;) around the point z; as a function 
of z¿+—1. This also shows that for small variance, the reverse distribution ¢(Z+|Z:—1) 
will have a covariance that is close to the covariance 6+I of the forward noise process 
q(Z+_1|Zz). We therefore model the reverse process using a Gaussian distribution of 
the form 

p(Zr—1|Ze, W) = N (z¢-1| (Zt, w, t), BT) (20.18) 


where j4(Zz, w, t) is a deep neural network governed by a set of parameters w. Note 
that the network takes the step index t explicitly as an input so that it can account 
for the variation of the variance (; across different steps of the chain. This allows us 
to use a single network to invert all the steps in the Markov chain, instead of having 
to learn a separate network for each step. It is also possible to learn the covariances 
of the denoising process by incorporating further outputs in the network to account 
for the curvature in the distribution g(z+_1) in the neighbourhood of z+ (Nichol and 
Dhariwal, 2021). There considerable flexibility in the choice of architecture for the 
neural network used to model p(z, w, t) provided the output has the same dimen- 
sionality as the input. Given this restriction, a U-net architecture is a common choice 
for image processing applications. 

The overall reverse denoising process then takes the form of a Markov chain 
given by 


P 
p(X, Z1,---,27|w) = p(zr) {LL t2teom} tan) (20.19) 
t=2 


Here p(zr) is assumed to be the same as the distribution of q(zr) and hence is 
given by MN (zr|0, I). Once the model has been trained, sampling is straightforward 
because we first sample from the simple Gaussian p(zr) and then we sample se- 
quentially from each of the conditional distributions p(z:—1|zz, w) in turn, finally 
sampling from p(x|z1, w) to obtain a sample x in the data space. 


20.2.1 Training the decoder 


We next have to decide on an objective function for training the neural network. 
The obvious choice is the likelihood function, which for data point x is given by 


p(x|w) = f . fre Z1,...,a27|w)dz,... dar (20.20) 
in which p(x, z1,...,Z7|w) is defined by (20.19). This is an instance of the general 
latent-variable model (16.81) in which the latent variables comprise z = (Z1,...,Z7) 


and the observed variable is x. Note that the latent variables all have the same dimen- 
sionality as the data space, as was the case for normalizing flows but not for varia- 
tional autoencoders or generative adversarial networks. We see from (20.20) that the 
likelihood involves integrating over all possible trajectories by which noise samples 
could give rise to the observed data point. The integrals in (20.20) are intractable as 
they involve integrating over the highly complex neural network functions. 
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20.2.2 Evidence lower bound 


Since the exact likelihood is intractable, we can adopt a similar approach to that 
used with variational autoencoders and maximize a lower bound on the log likelihood 
called the evidence lower bound (ELBO), which we re-derive here in the context of 
diffusion models. For any choice of distribution q(z), the following relation always 
holds: 

Inp(x|w) = L(w) + KL (q(2z)|Ip(zlx, w)) (20.21) 


where £ is the evidence lower bound, also known as the variational lower bound, 


given by 
p(x, z|w) 
L(w = f ya PZ S dz (20.22) 
(w) = fafa) ng PE 
and the Kullback—Leibler divergence KL (f||g) between two probability densities 
f(z) and g(z) is defined by 


== z) In glz) Z 
KEGN) =- f fem f SE | az. (20.28) 


To verify the relation (20.21) first note that, from the product rule of probability, we 
have 
p(x, z|w) = p(z|x, w)p(x|w). (20.24) 


Substituting (20.24) into (20.22) and making use of (20.23) gives (20.21). The 
Kullback—Leibler divergence has the property KL (-||-) > 0 from which it follows 
that 

In p(x|w) > L(w). (20.25) 


Since the log likelihood function is intractable, we train the neural network by max- 
imizing the lower bound L(w). 

To do this, we first derive an explicit form for the lower bound of the diffusion 
model. In defining the lower bound we are free to choose any form we like for q(z) as 
long as it is a valid probability distribution, i.e., that it is non-negative and integrates 
to 1. With many applications of the ELBO, such as the variational autoencoder, we 
chose a form for q(z) that has adjustable parameters, often in the form of a deep 
neural network, and then we maximize the ELBO with respect to those parameters 
as well as with respect to the parameters of the distribution p(x, z|w). Optimizing 
the distribution g(z) encourages the bound to be tight, which brings the optimization 
of the parameters in p(x, z|w) closer to that of maximum likelihood. With diffusion 
models, however, we chose q(z) to be given by the fixed distribution q(z1,..., z7|x) 
defined by the Markov chain (20.5), and so the only adjustable parameters are those 
in the model p(x, z1,...,2Z7|w) for the reverse Markov chain. Note that we are 
using the flexibility in the choice of q(z) to select a form that depends on x. 

We therefore substitute for q(z1,..., zr|x) in (20.21) using (20.5), and likewise 
we substitute for p(x, Z1, ...,Zr|w) using (20.19), which allows us to write the 
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ELBO in the form 


_ lar) (Tiai ze w) pelea, w) 


L(w) =E 
í ql(zı|x) [f a(zi|zt-1; x) 


q(Zi|Z+-1; X) 


T 
=E; mnn) j Xoin plzialzn w) In q(Z1|x) npn) (20.26) 


where we have defined 


T 


Eyl: = f | da [Jaz] dzı ... dar. (20.27) 


t=2 


The first term In p(zr) on the right-hand side of (20.26) is just the fixed distri- 
bution N (zr|0, I). This has no trainable parameters and can therefore be omitted 
from the ELBO since it represents a fixed additive constant. Similarly, the third term 
— In q(z,|x) is independent of w and so again can be omitted. 

The fourth term on the right-hand side of (20.26) corresponds to the reconstruc- 
tion term from the variational autoencoder. It can be evaluated by approximating the 
expectation E, | - ] by a Monte Carlo estimate obtained by drawing samples from the 
distribution over z; defined by (20.2) so that 


L 
Ky (In p(x|z1, w)] > X In p(x|z? , w) (20.28) 
I=1 


where 2) ~ N(a1|V1 — 81x, 8:1). Unlike with VAEs we do not need to back- 
propagate an error signal through the sampled value because the q-distribution is 
fixed and so there is no need here for the reparameterization trick. 

This leaves the second term on the right-hand side of (20.26), which comprises a 
sum of terms each of which is dependent on a pair of adjacent latent-variable values 
Zt— 1 and z,. We saw earlier when we derived the diffusion kernel (20.6) that we can 
sample from q(z+—1|x) directly as a Gaussian distribution and we could then obtain 
a corresponding sample of z; using (20.4), which is also a Gaussian. Although this 
would be a correct procedure in the limit of an infinite number of samples, the use of 
pairs of sampled values creates very noisy estimates with high variance, so that an 
unnecessarily large numbers of samples is required. Instead, we rewrite the ELBO 
in a form that can be estimated by sampling just one value per term. 


20.2.3 Rewriting the ELBO 


Following our discussion of the ELBO for the variational autoencoder, our goal 
here is to write the ELBO in terms of Kullback—Leibler divergences, which we can 
then subsequently express in closed form. The neural network is a model of the 
distribution in the reverse direction p(z:—1|Z:, w) whereas the q-distribution is ex- 
pressed in the forward direction q(z:|z:-1,x), and so we use Bayes’ theorem to 
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reverse the conditional distribution by writing 


q(Zt—1 |Ze, x)q(z:|x) 


q(Z:|Z1-1,X) = (20.29) 
q(Zt—1|X) 
This allows us to write the second term in (20.26) in the form 
py Oe) ge EE) yg a, (20.30) 


q(Zt|Zt—1, X) q(Zt—1|Zt, X q(Zt|X) 


The second term on the right-hand side of (20.30) is independent of w and so can be 
omitted. Substituting (20.30) into (20.26), we then obtain 


L(w) = By Saz palew) + in p(xizr,w)|. 20.31) 


(Zi 1|Zt, x) 
Finally, we can rewrite (20.31) in the form 
L(w) -f q(Z1|x) In p(x|z1, w) dz, 
reconstruction term 


T 
-Y | Kuale) pliz wal) dz; (20.32) 


consistency terms 


where we have simplified the expectation over q(Z1, . . . , Zr|x) in the first term since 
Z: is the only latent variable appearing in the integrand. Therefore in the expectation 
defined by (20.27), all the conditional distributions integrate to unity leaving only 
the integral over zı. Likewise, in the second term, each integral involves only two 
adjacent latent variables z;_; and z+, and all remaining variables can be integrated 
out. 

The bound (20.32) is now very similar to the ELBO for the variational autoen- 
coder given by (19.14), except that there are now multiple encoder and decoder 
stages. The reconstruction term rewards high probability for the observed data sam- 
ple and can be trained in the same way as the corresponding term in the VAE by using 
the sampling approximation (20.28). The consistency terms in (20.32) are defined 
between pairs of Gaussian distributions and een can be expressed in closed 
form, as follows. The distribution g(z:_1|Z, x) is given by (20.15) whereas the dis- 
tribution p(z+—1|z:, w) is given by (20.18) and so the Kullback—Leibler divergence 
becomes 


KL(q(2+—1|Zt, X) ||p(Ze—1|Z4, W)) 


1 
= |m; (x, z) — (ze, w, t)||? + const (20.33) 
t 
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where m,(x, z+) is defined by (20.16) and where any additive terms that are inde- 
pendent of the network parameters w have been absorbed into the constant term, 
which plays no role in training. Each of the consistency terms in (20.32) has one 
remaining integral over z+, weighted by q(z;|x). This can be approximated by draw- 
ing a sample from q(z;|x), which can be done efficiently using the diffusion kernel 
(20.6). 

We see that the KL divergence (20.33) takes the form of a simple squared-loss 
function. Since we adjust the network parameters to maximize the lower bound in 
(20.32), we will be minimizing this squared error because there is a minus sign in 
front of the Kullback—Leibler divergence terms in the ELBO. 


20.2.4 Predicting the noise 


One modification that leads to higher quality results is to change the role of the 
neural network so that instead of predicting the denoised image at each step of the 
Markov chain it predicts the total noise component that was added to the original 
image to create the noisy image at that step (Ho, Jain, and Abbeel, 2020). To do this 
we first take (20.8) and rearrange to give 


1 Jl =a 
x= Zz Et. 
4/ At t 4/ At { 
If we now substitute this into (20.16) we can rewrite the mean m;(x, z+) of the 


reverse conditional distribution q(Z:—1|Z+, X) in terms of the original data vector x 
and the noise e to give 


(20.34) 


m(x, Zt) = a= fa e) : (20.35) 


Similarly, instead of a neural network (z+, w, t) that predicts the denoised image, 
we introduce a neural network g(z;, w, t) that aims to predict the total noise that was 
added to x to generate z;. Following the same steps that led to (20.35) shows that 
these two network functions are related by 


1 
(Zt, w,t) = Jt fa É =elen.w.t)} ; (20.36) 
We can now substitute (20.35) and (20.36) into (20.33) to give 


KL(q(Z¢—1|2t, )||P(Z+—-1|Z2, W)) 
TE __ ig 
2(1 = ay) = Bz) 


Z ai = a — Br) l|s(vax + V1 — ater, w, t) — ell? +const (20.37) 


Zt, W, t) — éll? + const 


where in the final line we have substituted for z; using (20.8). 
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The reconstruction term in the ELBO (20.32) can be approximated using (20.28) 
with a sampled value of zı. Using the form (20.18) for p(x|z, w) we have 


1 
~—||x — u(zı, w, 1)||? + const. (20.38) 
261 


If we substitute for (zı, w, 1) using (20.36) and we substitute for x using (20.1) 
and then make use of a, = (1 — (1), which follows from (20.7), we obtain 


In p(x|z1,w) = — 


In p(x|z1,w) = (z1, w, 1) — €,||? + const. (20.39) 


l 
2(1 — B)'* 
This is precisely the same form as (20.37) for the special case t = 1, and so the 
reconstruction and consistency terms can be combined. 

Ho, Jain, and Abbeel (2020) found empirically that performance is further im- 
proved simply by omitting the factor 3,/2(1 — a,)(1 — 8+) in front of (20.37), so 
that all steps in the Markov chain have equal weighting. Substituting this simplified 
version of (20.37) into (20.33) gives a training objective function in the form 


T 
L(w) =~ S*|le( Vax + vI = axes, w, t) — ell. (20.40) 
t=1 


The squared error on the right-hand side of (20.40) has a very simple interpretation: 
for a given step ¢ in the Markov chain and for a given training data point x, we sample 
a noise vector €, and use this to create the corresponding noisy latent vector z+; for 
that step. The loss function is then the squared difference between the predicted 
noise and the actual noise. Note that the network g(., -, -) is predicting the total noise 
added to the original data vector x, not just the incremental noise added in step t. 

When we use stochastic gradient descent, we evaluate the gradient vector of the 
loss function with respect to the network parameters for a randomly selected data 
point x from the training set. Also, for each such data point we randomly select a 
step t along the Markov chain, rather than evaluate the error for every term in the 
summation over t in (20.40). These gradients are accumulated over mini-batches of 
data samples and then used to update the weights. 

Also note that this loss function automatically builds in a form of data augmen- 
tation, because every time a particular training sample x is used it is combined with a 
fresh sample €+ of noise. All the above relates to a single data point x from the train- 
ing set. The corresponding computation of the gradient is shown in Algorithm 20.1. 


20.2.5 Generating new samples 


Once the network has been trained we can generate new samples in the data 
space by first sampling from the Gaussian distribution p(zr) and then denoising 
successively through each step of the Markov chain. Given a denoised sample z+ at 
step t, we generate a sample z;_, in three steps. First we evaluate the output of the 
neural network given by g(z;, w, t). From this we evaluate u(z+, w, t) using (20.36). 
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Algorithm 20.1: Training a denoising diffusion probabilistic model 


Input: Training data D = {xn} 
Noise schedule {(,,..., 8T} 
Output: Network parameters w 
for t € {1,..., T} do 
Qt — Éa = B,) // Calculate alphas from betas 
end for 


repeat 
x~D // Sample a data point 


tw nl ieee N // Sample a point along the Markov chain 
E ~ N(e|0, 1) // Sample a noise vector 
Ze /a:x+J/1— ae // Evaluate noisy latent variable 


L(w) = I|g(Ze, w, t) —e||? // Compute loss term 


Take optimizer step 
until converged 
return w 


Finally we generate a sample z,_; from p(z:—1|Z:, w) = N (z+—1 lulz, w, t), 6D) 
by adding noise scaled by the variance so that 


Zt—1 = H(z, w,t) + / Bre (20.41) 


where e ~ N (e|0, I). Note that the network g(-,-,-) predicts the total noise added 
to the original data vector x to obtain z+, but in the sampling step, we subtract off 
only a fraction 6;/./1 — a, of this noise from z,—; and then add additional noise 
with variance (; to generate z;_;. At the final step when we calculate a synthetic 
data sample x, we do not add additional noise since we are aiming to generate a 
noise-free output. The sampling procedure is summarized in Algorithm 20.2. 

The main drawback of diffusion models for generating data is that they require 
multiple sequential inference passes through the trained network, which can be com- 
putationally expensive. One way to speed up the sampling process is first to convert 
the denoising process to a differential equation over continuous time and then to use 
alternative efficient discretization methods to solve the equation efficiently. 

We have assumed in this chapter that the data and latent variables are continuous 
and that we can therefore use Gaussian noise models. Diffusion models can also 
be defined for discrete spaces (Austin et al., 2021), for example, to generate new 
candidate drug molecules in which part of the generation process involves choosing 
atom types from a subset of chemical elements. 

We have seen that diffusion models can be computationally intensive because 
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Algorithm 20.2: Sampling from a denoising diffusion probabilistic model 


Input: Trained denoising network g(z, w, t) 
Noise schedule {(,,..., 8r} 


Output: Sample vector x in data space 


zr ~ N (z|0, I) // Sample from final latent space 
fort € T,...,2 do 
A Mal = BA) // Calculate alpha 


// Evaluate network output 

H(Z,, w,t) < VAEN fz g(a w, t)} 

e ~ N (e|0, I) // Sample a noise vector 

Zt—1 <— [(Zz, w,t) + /f:€ // Add scaled noise 
end for 


= 1 By : T 
X fz, slz, w,t)} // Final denoising step 
return x 


they sequentially reverse a noise process that can have hundreds or thousands of 
steps. Song, Meng, and Ermon (2020) introduced a related technique called de- 
noising diffusion implicit models that relax the Markovian assumption on the noise 
process while retaining the same objective function for training. This thereby allows 
one or two orders of magnitude speed-up during sampling without degrading the 
quality of the generated samples. 


Score Matching 


The denoising diffusion models discussed so far in this chapter are closely related to 
another class of deep generative models that were developed relatively independently 
and which are based on score matching (Hyvärinen, 2005; Song and Ermon, 2019). 
These make use of the score function or Stein score, which is defined as the gradient 
of the log likelihood with respect to the data vector x and is given by 


s(x) = Vx ln p(x). (20.42) 


Here it is important to emphasize that the gradient is with respect to the data vector, 
not with respect to any parameter vector. Note that s(x) is a vector-valued function 
of the same dimensionality as x and that each element s;(x) = ô ln p(x)/0z; is 
associated with a corresponding element x; of x. For example, if x is an image then 
s(x) can also be represented as an image of the same dimensions with corresponding 
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Illustration of the score func- 
tion, showing a distribution in 
two dimensions comprising a 
mixture of Gaussians repre- 
sented as a heat map and the 
corresponding score function 
defined by (20.42) plotted as 
vectors on a regular grid of x- 
values. 


pixels. Figure 20.5 shows an example of a probability density in two dimensions, 
along with the corresponding score function. 

To see why the score function is useful, consider two functions q(x) and p(x) 
that have the property that their scores are equal, so that Vx In g(x) = Vx In p(x) 
for all values of x. If we integrate both sides of the equation with respect to x and 
take exponentials, we obtain g(x) = Kp(x) where K is a constant independent of 
x. So if we are able to learn a model s(x, w) of the score function then we have 
modelled the original data density, up to a multiplicative constant. 


20.3.1 Score loss function 


To train such a model we need to define a loss function that aims to match the 
model score function s(x, w) to the score function Vx ln p(x) of the distribution 
p(x) that generated the data. An example of such a loss function is the expected 
squared error between the model score and the true score, given by 


J(w) = ; I s(x, w) — Vx ln p(x)||? p(x) dx. (20.43) 


As we saw in the discussion of energy-based models, the score function does 
not require the associated probability density to be normalized, because the normal- 
ization constant is removed by the gradient operator, and so there is considerable 
flexibility in the choice of model. There are broadly two ways to represent the score 
function s(x, w) using a deep neural network. Each element s; of s corresponds to 
one of the elements x; of x, so the first approach is to have a network with the same 
number of outputs as inputs. However, the score function is defined to be the gradient 
of a scalar function (the log probability density), which is a more restricted class of 
functions. So an alternative approach is to have a network with a single output ¢(x) 
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and then to compute V,.¢(x) using automatic differentiation. This second approach, 
however, requires two backpropagation steps and is therefore computationally more 
expensive. For this reason, most applications simply adopt the first approach. 


20.3.2 Modified score loss 


One problem with the loss function (20.43) is that we cannot minimize it directly 
because we do not know the true data score Vx In p(x). All we have is the finite data 


set D = (X1, . . . , Xy ) from which we can construct an empirical distribution: 
IA 
po(x) = > 2 (x — Xn). (20.44) 


Here 6(x) is the Dirac delta function, which can be thought of informally as an 
infinitely tall ‘spike’ at x = O with the properties 


5(x)=0, x40 (20.45) 
[oe dx = 1. (20.46) 


Since (20.44) is not a differentiable function of x, we cannot compute its score func- 
tion. We can address this by introducing a noise model to ‘smear out’ the data points 
and give a smooth, differentiable representation of the density. This is known as a 
Parzen estimator or kernel density estimator and is defined by 


alz) = f a(zlx,0)p(0) dx (20.47) 
where q(z|x, 7) is the noise kernel. A common choice of kernel is the Gaussian 
q(z|x, 0) = N (z|x, 071). (20.48) 


Instead of minimizing the loss function (20.43), we then use the corresponding 
loss with respect to the smoothed Parzen density in the form 


J(w) = J |s(z,w) — V, ln qo(z)||? qo (Z) dz. (20.49) 


A key result is that by substituting (20.47) into (20.49) we can rewrite this loss 
function in an equivalent form given by (Vincent, 2011) 


J(w)= SI |s(z, w) — Vz ln q(z|x, o)||? q(z|x, o)p(x) dz dx + const. 


(20.50) 
If we substitute for p(x) using the empirical density (20.44), we obtain 


N 
1 
J(w) = IN `y: / \|s(z, w) — Vz ln q(z|xn, o)||? q(z|Xn, 0) dz+const. (20.51) 
n=1 


Figure 20.6 Examples of sampling trajec- 
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tories obtained using Langevin 
dynamics defined by (14.61) for 
the distribution shown in Fig- 
ure 20.5, showing three trajec- 
tories all starting at the centre of 
the plot. 


For the Gaussian Parzen kernel (20.48), the score function becomes 
1 
Vz Inq(z|x, o) = ——e (20.52) 
o 


where € = z — x is drawn from N (z|0, I). If we consider the specific noise model 
(20.6) then we obtain 


Vz lnq(z|x, o) = — =i (20.53) 
Via 

We therefore see that the score loss (20.50) measures the difference between the 
neural network prediction and the noise e. Therefore, this loss function has the same 
minimum as the form (20.37) used in the denoising diffusion model, with the score 
function s(z, w) playing the same role as the noise prediction network g(z, w) up 
to a constant scaling —1/,/1 — a; (Song and Ermon, 2019). Minimizing (20.50) is 
known as denoising score matching, and we see the close connection to denoising 
diffusion models. There remains the question of how to choose the noise variance 
o7, and we will return to this shortly. 

Having trained a score-based model we then need to draw new samples. Langevin 
dynamics is well-suited to score-based models because it is based on the score func- 
tion and therefore does not require a normalized probability distribution, and is il- 
lustrated in Figure 20.6. 


20.3.3 Noise variance 


We have seen how to learn the score function from a set of training data and how 
to generate new samples from the learned distribution using Langevin sampling. 
However, we can identify three potential problems with this approach (Song and 
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Ermon, 2019; Luo, 2022). First, if the data distribution lies on a manifold of lower 
dimensionality than the data space, the probability density will be zero at points off 
the manifold and here the score function is undefined since In p(x) is undefined. 
Second, in regions of low data density, the estimate of the score function may be 
inaccurate since the loss function (20.43) is weighted by the density. An inaccurate 
score function can lead to poor trajectories when using Langevin sampling. Third, 
even with an accurate model of the score function, the Langevin procedure may not 
sample correctly if the data distribution comprises a mixture of disjoint distributions. 

All three problems can be addressed by choosing a sufficiently large value for 
the noise variance g? used in the kernel function (20.48), because this smears out the 
data distribution. However, too large a variance will introduce a significant distortion 
of the original distribution and this itself introduces inaccuracies in the modelling of 
the score function. This trade-off can be addressed by considering a sequence of 
variance values o? < o2 < ... < oa (Song and Ermon, 2019), in which o? is 
sufficiently small that the data distribution is accurately represented whereas 7%, is 
sufficiently large that the aforementioned problems are avoided. The score network 
is then modified to take the variance as an additional input s(x, w, 0”) and is trained 
by using a loss function that is a weighted sum of the loss functions of the form 
(20.51) in which each term represents the error between the associated network and 
the corresponding perturbed data set. For a data vector x,,, the loss function then 
takes the form 


L 
; > A(t) I ||s(z, w, 02) — Vz Ing(z|xn, o)l? q(z|Xn, ci) dz (20.54) 


i=1 


where A(i) are weighting coefficients. We see that this training procedure precisely 
mirrors that used to train hierarchical denoising networks. 

Once trained, samples can be generated by running a few steps of Langevin 
sampling from each of the models for? = L, L — 1, ...,2, 1 in turn. This technique 
is called annealed Langevin dynamics, and is analogous to Algorithm 20.2 used to 
sample from denoising diffusion models. 


20.3.4 Stochastic differential equations 


We have seen that it is helpful to use a large number of steps, often several 
thousand, when constructing the noise process for a diffusion model. It is therefore 
natural to ask what happens if we consider the limit of an infinite number of steps, 
much as we did for infinitely deep neural networks when we introduced neural dif- 
ferential equations. In taking such a limit, we need to ensure that the noise variance 
bt at each step becomes smaller in keeping with the step size. This leads to a formu- 
lation of diffusion models for continuous time as stochastic differential equations or 
SDEs (Song et al., 2020). Both denoising diffusion probabilistic models and score 
matching models can then be viewed as a discretization of a continuous-time SDE. 

We can write a general SDE as an infinitesimal update to the vector z in the form 


dz =f(z,t)dt+ g(t)dv (20.55) 
—S=S—_’ Sa’ 
drift diffusion 
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where the drift term is deterministic, as in an ODE, but the diffusion term is stochas- 
tic, for example given by infinitesimal Gaussian steps. Here the parameter t is often 
called ‘time’ by analogy with physical systems. The forward noise process (20.3) 
for a diffusion model can be written as an SDE of the form (20.55) by taking the 
continuous-time limit. 
For the SDE (20.55), there is a corresponding reverse SDE (Song et al., 2020) 
given by 
dz = {f(z,t) — 9?(t)Vznp(z)} dt + g(t) dv (20.56) 


where we recognize V, In p(z) as the score function. The SDE given by (20.55) is 
to be solved in reverse from t = T tot = 0. 

To solve an SDE numerically, we need to discretize the time variable. The 
simplest approach is to use fixed, equally spaced time steps, which is known as 
the Euler-Maruyama solver. For the reverse SDE, we then recover a form of the 
Langevin equation. However, more sophisticated solvers can be employed that use 
more flexible forms of discretization (Kloeden and Platen, 2013). 

For all diffusion processes governed by an SDE, there exists a corresponding de- 
terministic process described by an ODE whose trajectories have the same marginal 
probability densities p(z|t) as the SDE (Song et al., 2020). For an SDE of the form 
(20.56), the corresponding ODE is given by 


d 1 

= f(z, t) — 29° (t)Vz ln p(z). (20.57) 
dt 2 

The ODE formulation allows the use of efficient adaptive-step solvers to reduce the 

number of function evaluations dramatically. Moreover, it allows probabilistic dif- 

fusion models to be related to normalizing flow models, from which the change- 


of-variables formula (18.1) can be used to provide an exact evaluation of the log 
likelihood. 


Guided Diffusion 


So far, we have considered diffusion models as a way to represent the unconditional 
density p(x) learned from a set of training examples x1,...,x, drawn indepen- 
dently from p(x). Once the model has been trained, we can generate new samples 
from this distribution. We have already seen an example of unconditional sampling 
from a deep generative model for face images in Figure 1.3, in that case from a GAN 
model. 

In many applications, however, we want to sample from a conditional distribu- 
tion p(x|c) where the conditioning variable c could, for example, be a class label or 
a textual description of the desired content for an image. This also forms the basis 
for applications such as image super-resolution, image inpainting, video generation, 
and many others. The simplest approach to achieving this would be to treat c as an 
additional input into the denoising neural network g(z, w, t, c) and then to train the 
network using matched pairs {Xn, Cn}. The main limitation of this approach is that 
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the network can give insufficient weight to, or even ignore, the conditioning vari- 
ables, so we need a way to control how much weight is given to the conditioning 
information and to trade this off against sample diversity. This additional pressure 
to match the conditioning information is called guidance. There are two main ap- 
proaches to guidance depending on whether or not a separate classifier model is 
used. 


20.4.1 Classifier guidance 


Suppose that a trained classifier p(c|x) is available, and consider a diffusion 
model from the perspective of the score function. Using Bayes’ theorem we can 
write the score function for the conditional diffusion model in the form 


0) = Ve n S Bicep) 
Vx ln p(x|c) = V1 { ple) l 
= Vx ln p(x) + Vx ln p(c|x) (20.58) 


where we have used Vx ln p(c) = 0 since p(c) is independent of x. The first term 
on the right-hand side of (20.58) is the usual unconditional score function, whereas 
the second term pushes the denoising process towards the direction that maximizes 
the probability of the given label c under the classifier model (Dhariwal and Nichol, 
2021). The influence of the classifier can be controlled by introducing a hyperpa- 
rameter A, called the guidance scale, which controls the weight given to the classifier 
gradient. The score function used for sampling then becomes 


score(x, c, A) = Vx ln p(x) + AV In p(c|x). (20.59) 


If A = 0 we recover the original unconditional diffusion model, whereas if A = 1 we 
obtain the score corresponding to the conditional distribution p(x|c). For A > 1 the 
model is strongly encouraged to respect the conditioning label, and values of A >> 1 
may be used, for example \ = 10. However, this comes at the expense of diversity in 
the samples as the model prefers ‘easy’ examples that the classifier is able to classify 
correctly. 

One problem with the classifier-based approach to guidance is that a separate 
classifier must be trained. Furthermore, this classifier needs to be able to classify 
examples with varying degrees of noise, whereas standard classifiers are trained on 
clean examples. We therefore turn to an alternative approach that avoids the use of a 
separate classifier. 


20.4.2 Classifier-free guidance 


If we use (20.58) to replace V, In p(c|x) in (20.59), we can write the score 
function in the form 


score(x, c, A) = AV Inp(x|c) + (1 — A) Vx In p(x), (20.60) 


which for 0 < A < 1 represents a convex combination of the conditional log density 
In p(x|c) and the unconditional log density In p(x). For A > 1 the contribution from 
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the unconditional score becomes negative, meaning the model actively reduces the 
probability of generating samples that ignore the conditioning information in favour 
of samples that do. 

Furthermore, we can avoid training separate networks to model p(x|c) and p(x) 
by training a single conditional model in which the conditioning variable c is set to 
a null value, for example c = 0, with some probability during training, typically 
around 10-20%. Then p(x) is represented by p(x|c = 0). This is somewhat anal- 
ogous to dropout in which the conditioning inputs are collectively set to zero for a 
random subset of training vectors. 

Once trained, the score function (20.60) is then used to encourage a strong 
weighting of the conditional information. In practice, classifier-free guidance gives 
much higher quality results than classifier guidance (Nichol et al., 2021; Saharia et 
al., 2022). The reason is that a classifier p(c|x) can ignore most of the input vector 
x as long as it makes a good prediction of c whereas classifier-free guidance is based 
on the conditional density p(x|c), which must assign a high probability to all aspects 
of x. 

Text-guided diffusion models can leverage techniques from large language mod- 
els to allow the conditioning input to be a general text sequence, known as a prompt, 
and not simply a selection from a predefined set of class labels. This allows the text 
input to influence the denoising process in two ways, first by concatenating the inter- 
nal representation from a transformer-based language model with the input to the de- 
noising network and second by allowing cross-attention layers within the denoising 
network to attend to the text token sequence. Classifier-free guidance, conditioned 
on a text prompt, is illustrated in Figure 20.7. 

Another application for conditional diffusion models is image super-resolution 
in which a low-resolution image is transformed into a corresponding high-resolution 
image. This is intrinsically an inverse problem, and multiple high-resolution im- 
ages will be consistent with a given low-resolution image. Super-resolution can 
be achieved by denoising a high-resolution sample from a Gaussian using the low- 
resolution image as a conditioning variable (Saharia, Ho, et al., 2021). Examples of 
this method are shown in Figure 20.8. Such models can be cascaded to achieve very 
high resolution (Ho et al., 2021), for example going from 64 x 64 to 256 x 256, 
and then from 256 x 256 to 1024 x 1024. Each stage is typically represented by a 
U-net architecture, with each U-net conditioned on the final denoised output of the 
previous one. 

This type of cascade can also be used with image-generation diffusion models, 
in which the image denoising is performed at a lower resolution and the result is sub- 
sequently up-sampled using a separate network (which may also take a text prompt 
as input) to give a final high-resolution output (Nichol et al., 2021; Saharia et al., 
2022). This can significantly reduce the computational cost compared to working 
directly in a high-dimensional space since the denoising process may involve hun- 
dreds of passes through the denoising network. Note that these approaches still work 
within the image space directly but at lower resolution. 

A different approach to addressing the high computational cost of applying dif- 
fusion models directly in the space of high-resolution images is called latent diffu- 
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Figure 20.7 Illustration of classifier-free guidance of diffusion models, generated from a model called GLIDE 
using the conditioning text A stained glass window of a panda eating bamboo. Examples on the left were 
generated with A = 0 (no guidance, just the plain conditional model) whereas examples on the right were 
generated with \ = 3. [From Nichol et al. (2021) with permission.] 


Section 19.1 sion models (Rombach et al., 2021). Here an autoencoder is first trained on noise- 
free images to obtain a lower-dimensional representation of the images and is then 
fixed. A U-net architecture is then trained to perform the denoising within the lower- 
dimensional space, which itself is not directly interpretable as an image. Finally, 
the denoised representation is mapped into the high-resolution image space using 
the output half of the fixed autoencoder network. This approach makes more effi- 
cient use of the low-dimensional space, which can then focus on image semantics, 
leaving the decoder to create a corresponding sharp, high-resolution image from the 
denoised low-dimensional representation. 

There are many other applications of conditional image generation including 
inpainting, un-cropping, restoration, image morphing, style transfer, colourization, 
de-blurring, and video generation (Yang, Srivastava, and Mandt, 2022). An example 
of inpainting is shown in Figure 20.9. 
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Figure 20.8 Two examples of low- 
resolution images along with associ- 
ated samples of corresponding high- 
resolution images generated by a 
diffusion model. The top row shows 
a 16 x 16 input image and the cor- 
responding 128 x 128 output image 
along with the original image from 
which the input image was gener- 
ated. The bottom row shows a 64 x 
64 input image with a 256 x 256 out- 
put image, again with the original im- 
age for comparison. [From Saharia, 
Ho, et al. (2021) with permission.] 


wv 


Original 


Figure 20.9 Example of inpaint- 
ing showing the original image on 
the left, an image with sections re- 
moved in the middle, and the image 
with inpainting on the right. [From 
Saharia, Chan, Chang, et al. (2021) 
with permission.] 


Exercises , , : 
20.1 (x) Using (20.3) write down expressions for the mean and covariance of z; in terms 


of the mean and covariance of z;_,. Hence, show that for 0 < 6; < 1 the mean of 
the distribution of z+ is closer to zero than the mean of z;_1 and that the covariance 
of z; is closer to the unit matrix I than the covariance of z;_1. 


20.2 (x) Show that the transformation (20.1) can be written in the equivalent form (20.2). 


20.3 (xxx) In this exercise we use proof by induction to show that the marginal distri- 
bution of x; for the forward process of the diffusion model, as defined by (20.4), is 
given by (20.6) where a; is defined by (20.7). First verify that (20.6) holds when 
t = 1. Now assume that (20.6) is true for some particular value of t and derive the 
corresponding result for the value t + 1. To do this, it is easiest to write the for- 
ward process using the representation (20.3) and to make use of the result (3.212), 
which shows that the sum of two independent Gaussian random variables is itself a 
Gaussian in which the means and covariances are additive. 


20.4 (x) By using the result (20.6), where a; is defined by (20.7), show that in the limit 
T — œ we obtain (20.9). 
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20.5 (xx) Consider two independent random variables a and b along with a fixed scalar 
A. Show that 


cov[a + b] = coy[a] + cov[b] (20.61) 
cov[Aa] = A?cov[al. (20.62) 


Use these results to show that if the distribution of Z+—ı has zero mean and unit 
covariance, then the distribution of z+, defined by (20.3), will also have zero mean 
and unit covariance, irrespective of the value of 6+. 


20.6 (xxx) In this exercise we will use the technique of completing the square to derive 
the result (20.15) starting from Bayes’ theorem (20.13). First note that the two terms 
in the numerator on the right-hand side of (20.13), given by (20.4) and (20.6), both 
take the form of exponentials of quadratic functions of z:;_1. The required distribu- 
tion is therefore a Gaussian, and so we need only to find its mean and covariance. To 
do this, consider only the terms in the exponentials that depend on z,_ and note that 
the product of two exponentials is the exponential of the sum of the two exponents. 
Gather together all the terms that are quadratic in z;_1 as well as those that are linear 
in z;_, and then rearrange them in the form (z;_; — m,)'S; '(z;_; — m+). Then, 
by inspection, find expressions for m;(x, Z+) and S,. Note that additive terms that 
are independent of z+—; can be ignored. 


20.7 (xxx) In this exercise we show that the reverse of the conditional distribution q(Zz+|Z4—1) 
for the forward noise process in a diffusion model can be approximated by a Gaus- 
sian when the noise variance is small. Consider the inverse conditional distribution 
q(Zt—1|Zt) given by Bayes’ theorem in the form (20.11) where the forward distribu- 
tion q(Zz|Z:_1) is given by (20.4). By taking the logarithm of both sides of (20.11) 
and then making a Taylor expansion of q(Z+—1) centred on the value z+, show that, 
for small values of the noise variance (;, the distribution q(Z+—1|Z+) is approximately 
a Gaussian with mean z; and covariance (I. Find expressions for the lowest-order 
corrections to the mean and to the covariance as expansions in powers of b+. 


20.8 (xx) By substituting the product rule of probability in the form (20.24) into the defi- 
nition (20.22) of the ELBO for the diffusion model and making use of the definition 
(20.23) of the Kullback—Leibler divergence, verify that the log likelihood function 
can be written as the sum of a lower bound and a Kullback—Leibler divergence in the 
form (20.21). 


20.9 (xx) Verify that the ELBO for the diffusion model given by (20.31) can be written 
in the form (20.32) where the Kullback—Leibler divergence is defined by (20.23). 


20.10 (xx) When we derived the ELBO for the diffusion model given by (20.32), we omit- 
ted the first and third terms in (20.26) because they are independent of w. Similarly 
we omitted the second term in the right-hand side of (20.30) because this is also in- 
dependent of w. Show that if all of these omitted terms are retained they lead to an 
additional term in the ELBO £(x) given by 


KL (q(zr|x)||p(zr)) - (20.63) 


20.11 


20.12 
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Note that the noise process is constructed in such a way that the distribution q(z7|x) 
is equal to the Gaussian M (x|0, I). Similarly, the distribution p(z7) is defined to be 
equal to M (x|0, I), and hence the two distributions in (20.63) are equal and so the 
Kullback—Leibler divergence vanishes. 


(x x) By making use of (20.15) for the distribution q(Z+—1|Z+, x) and (20.18) for the 
distribution p(z:_1|Z, w), show that the Kullback—Leibler divergence appearing in 
the consistency terms in (20.32) is given by (20.33). 


(x x) By substituting (20.34) into (20.16) rewrite the mean m,(x, z+) in terms of the 
original data vector x and the noise e€ in the form (20.35), where a; is defined by 
(20.7). 


(«x x) Show that the reconstruction term (20.38) in the ELBO for diffusion models can 
be written in the form (20.39). To do this, substitute for (zı, w, 1) using (20.36) 
and substitute for x using (20.1), and then make use of a; = (1— 81), which follows 
from (20.7). 


(x) The score function is defined by s(x) = Vxp(x|w) and is therefore a vector of 
the same dimensionality as the input vector x. Consider a matrix whose elements 
are given by 
i Os; Os j 

T On, Ox; ` 
Show that if the score function is defined by taking the gradient s = V,¢(x) of 
the output of a neural network with a single output variable (x), then all the matrix 
elements M;; = 0 for all pairs 7, j. Note that if the score function s(x) = Vxp(x|w) 
is instead represented directly by a deep neural network with the same number of 
outputs as inputs, then only the diagonal matrix elements M;; = 0, and so the output 
of the network does not in general correspond to the gradient of any scalar function. 


(20.64) 


(xx) Consider a deep neural network representation s(x, w) for the score function 
defined by (20.42), where x and s have dimensionality D. Compare the computa- 
tional complexity of evaluating the score for a network with D outputs that represents 
the score function directly with one that computes a single scalar function $(x, w) 
in which the score function is computed indirectly through automatic differentiation. 
Show that the latter approach is typically more computationally expensive. 


(xxx) We cannot minimize the score function (20.43) directly because we do not 
know the functional form of the true data density p(x), and therefore we cannot 
write down an expression for the score function V, ln p(x). However, by using 
integration by parts (Hyvärinen, 2005), we can rewrite (20.43) in the form 


J(w) = f [vs w) + s(x, wI? d pco dx + const (20.65) 
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20.17 


20.18 


where the constant term is independent of the network parameters w, and the diver- 
gence V-s(x, w) is defined by 


D D 


OSs; 0? In p(x) 
V:s= J z J 20. 
S = Ox; Ox? eu) 


i=1 4 


in which D is the dimensionality of x. Derive the result (20.65) by first expanding the 
square in (20.43) and noting that the term involving ||s(x, w)||? already appears in 
(20.43) whereas the term involving ||sp ||? can be absorbed into the additive constant, 
where we have defined sp = V Inpp(x). Now consider the formula 


dg(x) 
dx 


L pose) = ZO 


g(x) + p(z) (20.67) 
for the derivative of the product of two functions. Integrate both sides of this formula 
with respect to x and rearrange to obtain the integration-by-parts formula: 


a PO sa) dx = -f BO 90) dz 2008) 


— 0O —Cco 


where we have assumed that p(oo) = p(—oo) = 0. Apply this result together 
with the definition sp = V In p(x) to the term involving s(x, w)'sp to complete 
the proof. Note that the evaluation of the second derivatives in (20.66) requires a 
separate backward propagation pass for each derivative and hence has an overall 
computational cost that grows quadratically with the dimensionality D of the data 
space (Martens, Sutskever, and Swersky, 2012). This precludes the direct application 
of this loss function to spaces of high dimensionality, and so techniques such as 
sliced score matching (Song et al., 2019) have been developed to help address this 
inefficiency. 


(x x) In this exercise we show that the score function loss (20.50) is equivalent, up 
to an additive constant, to the form (20.49). To do this, first expand the square in 
(20.49) and by using (20.47) show that the term in s''s from (20.49) is the same as the 
corresponding term obtained by expanding the square in (20.50). Next note that the 
term in ||V,, In q||? in (20.49) is independent of w and likewise that the corresponding 
term in (20.50) is also independent of w, and so these can be viewed as additive 
constants in the loss function and play no role in training. Finally, consider the cross- 
term in (20.49). By substituting for g(z) using (20.47), show that this is equal to the 
corresponding cross-term from (20.50). Hence, show that the two loss functions are 
equal up to an additive constant. 


(x) Consider a probability distribution that consists of a mixture of two disjoint dis- 
tributions (i.e., distributions with the property that when one of them is non-zero the 
other must be zero) of the form 


p(x) = Apa(x) + (1 — à)pg (x). (20.69) 
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Show that when the score function, defined by (20.42), is evaluated for any given 
point x, the mixing coefficient \ does not appear. From this it follows that Langevin 
dynamics defined by (14.61) will not sample from the two component distributions 
with the correct proportions. This problem is resolved by adding noise from a broad 
distribution, as discussed in the text. 


(x x) For discrete steps, the forward noise process in a diffusion model is defined by 
(20.3). Here we take the continuous-time limit and convert this to an SDE. We first 
introduce a continuously-changing variance function ((t) such that 6, = 6(t)At. 
By making a Taylor expansion of the square root in the first term on the right-hand- 
side of (20.3), show that the infinitesimal update can be written in the form 


p= —5 A(t) dt + \/B( av. (20.70) 


We see that this is a special case of the general SDE (20.55). 


(x) By using (20.58) to replace Vx In p(c|x), show that the score function in (20.59) 
can be written in the form (20.60). 


A.1. 


Appendix A. Linear Algebra 


In this appendix, we gather together some useful properties and identities involving 
matrices and determinants. This is not intended to be an introductory tutorial, and 
it is assumed that the reader is already familiar with basic linear algebra. For some 
results, we indicate how to prove them, whereas in more complex cases we leave 
the interested reader to refer to standard textbooks on the subject. In all cases, we 
assume that inverses exist and that matrix dimensions are such that the formulae 
are correctly defined. A comprehensive discussion of linear algebra can be found in 
Golub and Van Loan (1996), and an extensive collection of matrix properties is given 
by Liitkepohl (1996). Matrix derivatives are discussed in Magnus and Neudecker 
(1999). 


Matrix Identities 


A matrix A has elements A;; where 7 indexes the rows and j indexes the columns. 
We use Iy to denote the N x N identity matrix (also called the unit matrix), and 
if there is no ambiguity over dimensionality, we simply use I. The transpose matrix 
AT has elements (AT); = Aji. From the definition of a transpose, we have 


(AB) = BAT, (A.1) 


which can be verified by writing out the indices. The inverse of A, denoted AT}, 
satisfies 
AAT=A™tA=L (A.2) 


Because ABB~!A7—! = I, we have 


(AB) '=B'A™. (A.3) 
Also we have a r 

(AT) = (a), (A4) 
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which is easily proven by taking the transpose of (A.2) and applying (A.1). 
A useful identity involving matrix inverses is the following: 


(P-'+B'R'B)'B'R ' =PB'(BPB' +R)’, (A.5) 


which is easily verified by right-multiplying both sides by (BPB™ + R). Suppose 
that P has dimensionality N x N and that R has dimensionality M x M, so that B 
is M x N. Then if M < N, it will be much cheaper to evaluate the right-hand side 
of (A.5) than the left-hand side. A special case that sometimes arises is 


(I+ AB)~'A = A(I+BA)7}. (A.6) 


Another useful identity involving inverses is the following: 


(A + BDC)! = A7!- A`'B(D + CA“! B)“'CA7!, (A.7) 


which is known as the Woodbury identity. It can be verified by multiplying both sides 
by (A + BD~'!C). This is useful, for instance, when A is large and diagonal and 
hence easy to invert, and when B has many rows but few columns (and conversely 
for C), so that the right-hand side is much cheaper to evaluate than the left-hand 
side. 

A set of vectors {a,,..., ay } is said to be linearly independent if the relation 
Xn Anan = 0 holds only if all a,, = 0. This implies that none of the vectors 
can be expressed as a linear combination of the remainder. The rank of a matrix is 
the maximum number of linearly independent rows (or equivalently the maximum 
number of linearly independent columns). 


Traces and Determinants 


Square matrices have traces and determinants. The trace Tr(A) of a matrix A is 
defined as the sum of the elements on the leading diagonal. By writing out the 
indices, we see that 

Tr(AB) = Tr(BA). (A.8) 


By applying this formula multiple times to the product of three matrices, we see that 
Tr(ABC) = Tr(CAB) = Tr(BCA), (A.9) 


which is known as the cyclic property of the trace operator. It clearly extends to the 
product of any number of matrices. The determinant |A| of an N x N matrix A is 
defined by 


|A| = X (41) Ari, Avis ++ Arvin (A.10) 


in which the sum is taken over all products consisting of precisely one element from 
each row and one element from each column, with a coefficient +1 or —1 according 
to whether the permutation 7,72... is even or odd, respectively. Note that |I| = 1, 
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and that the determinant of a diagonal matrix is given by the product of the elements 
on the leading diagonal. Thus, for a 2 x 2 matrix, the determinant takes the form 


G11 Q12 
Q21 Q22 


|A] = 


= 411422 — Q12421- (A.11) 


The determinant of a product of two matrices is given by 
|AB| = |AI|B| (A.12) 


as can be shown from (A.10). Also, the determinant of an inverse matrix is given by 


1 
A |S —, (A.13) 
neal aT 
which can be shown by taking the determinant of (A.2) and applying (A.12). 
If A and B are matrices of size N x M, then 
[Ivy + AB*| = |Iy + A*BI. (A.14) 
A useful special case is 
[Iy +ab*|=1+a'b (A.15) 


where a and b are N-dimensional column vectors. 


Matrix Derivatives 


Sometimes we need to consider derivatives of vectors and matrices with respect to 
scalars. The derivative of a vector a with respect to a scalar x is a vector whose 


components are given by 
ða Oa; 
=] = A.16 
(= ; ox ( ) 
with an analogous definition for the derivative of a matrix. Derivatives with respect 
to vectors and matrices can also be defined, for instance 


Ox Ox 

6) E (A.17) 
and similarly 
ða Oa; 
—) =—. A.18 
( = ij Ob; 
The following is easily proven by writing out the components: 
o o 

= (xTa) = = (atx) =a. (A.19) 
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Similarly 
o OA OB 
AB) = B+A—. A.20 
Ox ee) Ox Ox ( ) 
The derivative of the inverse of a matrix can be expressed as 
o OA 
— (A7!) =A“ A A.21 
Ox ( ) Ox ( ) 


as can be shown by differentiating the equation A~'A = I using (A.20) and then 
right-multiplying by A7 +. Also 


ð 0A 


which we shall prove later. If we choose x to be one of the elements of A, we have 


o 


as can be seen by writing out the matrices using index notation. We can write this 
result more compactly in the form 


0 T 
za T (AB) = BT. (A.24) 


With this notation, we have the following properties: 


ð T 
za TA B) = B, (A.25) 
ð 
JTA) S= (A.26) 
Č- T(ABA") = A(B+B"), (A.27) 


which can again be proven by writing out the matrix indices. We also have 


o >ya -nT 
za "lAl = (A F3 (A.28) 


which follows from (A.22) and (A.24). 


Eigenvectors 
For a square matrix A of size M x M, the eigenvector equation is defined by 


Au; = Aiu; (A.29) 
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fori = 1,..., M, where u; is an eigenvector and A; is the corresponding eigenvalue. 
This can be viewed as a set of M simultaneous homogeneous linear equations, and 
the condition for a solution is that 


|A — A,I| =0, (A.30) 


which is known as the characteristic equation. Because this is a polynomial of order 
M in 4;, it must have M solutions (though these need not all be distinct). The rank 
of A is equal to the number of non-zero eigenvalues. 

Of particular interest are symmetric matrices, which arise as covariance ma- 
trices, kernel matrices, and Hessians. Symmetric matrices have the property that 
Aij = Aji, or equivalently AT = A. The inverse of a symmetric matrix is also sym- 
metric, as can be seen by taking the transpose of A~'A = I and using AA~' = I 
together with the symmetry of I. 

In general, the eigenvalues of a matrix are complex numbers, but for symmetric 
matrices, the eigenvalues A; are real. This can be seen by first left-multiplying (A.29) 
by (ux), where x denotes the complex conjugate, to give 


(ux)" Au; = A; (ux)" uy. (A.31) 
Next we take the complex conjugate of (A.29) and left-multiply by u7 to give 
u; Au} = Xu uz (A.32) 


where we have used A* = A because we are considering only real matrices A. 
Taking the transpose of the second of these equations and using AT = A, we see 
that the left-hand sides of the two equations are equal and hence that A% = A;, and 
so A; must be real. 

The eigenvectors u; of a real symmetric matrix can be chosen to be orthonormal 
(i.e., orthogonal and of unit length) so that 


u uj = Ij; (A.33) 


where J;,; are the elements of the identity matrix I. To show this, we first left-multiply 
(A.29) by u} to give 
u; Au; = Aju; u; (A.34) 


and hence, by exchanging the indices, we have 
u; Au; = Aju; uj. (A.35) 


We now take the transpose of the second equation and make use of the symmetry 
property AT = A, and then subtract the two equations to give 


(Ài — Aj) u; U; = 0. (A.36) 
Hence, for A; Æ Aj, we have u/u; = 0 so that u; and u; are orthogonal. If the two 
eigenvalues are equal, then any linear combination au; + fu, is also an eigenvector 
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with the same eigenvalue, so we can select one linear combination arbitrarily, and 
then choose the second to be orthogonal to the first (it can be shown that the de- 
generate eigenvectors are never linearly dependent). Hence, the eigenvectors can be 
chosen to be orthogonal, and by normalizing can be set to unit length. Because there 
are M eigenvalues, the corresponding M orthogonal eigenvectors form a complete 
set and so any M-dimensional vector can be expressed as a linear combination of 
the eigenvectors. 

We can take the eigenvectors u; to be the columns of an M x M matrix U, 
which from orthonormality satisfies 


UTU =I. (A.37) 


Such a matrix is said to be orthogonal. Interestingly, the rows of this matrix are also 

orthogonal, so that UUT = I. To show this, note that (A.37) implies UTUU™! = 

U`! = UT and so UU"! = UUT = I. Using (A.12), it also follows that |U| = 1. 
The eigenvector equation (A.29) can be expressed in terms of U in the form 


AU=UA (A.38) 


where A is an M x M diagonal matrix whose diagonal elements are given by the 
eigenvalues \;. 
If we consider a column vector x that is transformed by an orthogonal matrix U 
to give a new vector 
x = Ux (A.39) 


then the length of the vector is preserved because 

xX = xTUTUx =x'x (A.40) 
and similarly the angle between any two such vectors is preserved because 

xy = xTUTUy = x"y. (A.41) 


Thus, multiplication by U can be interpreted as a rigid rotation of the coordinate 
system. 
From (A.38), it follows that 


UTAU=A (A.42) 


and because A is a diagonal matrix, we say that the matrix A is diagonalized by the 
matrix U. If we left-multiply by U and right-multiply by UT, we obtain 


A=UAU". (A.43) 


Taking the inverse of this equation and using (A.3) together with U`! = UT, we 
have 
At =UA™'U". (A.44) 
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These last two equations can also be written in the form 


M 
A = X Aun? (A.45) 
{=l 
M 1 
Atl = D yuu. (A.46) 
i=1 °? 


If we take the determinant of (A.43) and use (A.12), we obtain 
M 
JA) =] [> (A.47) 
i=1 


Similarly, taking the trace of (A.43), and using the cyclic property (A.8) of the trace 
operator together with UTU = I, we have 


M 
Tr(A) = XA. (A.48) 
i=1 


We leave it as an exercise for the reader to verify (A.22) by making use of the results 
(A.33), (A.45), (A.46), and (A.47). 

A matrix A is said to be positive definite, denoted by A > 0, if wT Aw > 0 for 
all non-zero values of the vector w. Equivalently, a positive definite matrix has A; > 
0 for all of its eigenvalues (as can be seen by setting w to each of the eigenvectors in 
turn and noting that an arbitrary vector can be expanded as a linear combination of 
the eigenvectors). Note that having all positive elements does not necessarily mean 
that a matrix is that positive definite. For example, the matrix 


1 2 
( 3 4 ) (A.49) 


has eigenvalues A, ~ 5.37 and Aj ~ —0.37. A matrix is said to be positive semidef- 
inite if w' Aw > 0 holds for all values of w, which is denoted A > 0 and is 
equivalent to A; > 0. 

The condition number of a matrix is given by 


A 1/2 
CN = ( e=) (A.50) 


min 


where Amax is the largest eigenvalue and Amin is the smallest eigenvalue. 


Appendix B. Calculus of Variations 


We can think of a function y(x) as being an operator that, for any input value z, 
returns an output value y. In the same way, we can define a functional F'[y] to be 
an operator that takes a function y(x) and returns an output value F. An example 
of a functional is the length of a curve drawn in a two-dimensional plane in which 
the path of the curve is defined in terms of a function. In the context of machine 
learning, a widely used functional is the entropy H[z] for a continuous variable x 
because, for any choice of probability density function p(x), it returns a scalar value 
representing the entropy of x under that density. Thus, the entropy of p(x) could 
equally well have been written as H[p]. 

A common problem in conventional calculus is to find a value of x that max- 
imizes (or minimizes) a function y(x). Similarly, in the calculus of variations we 
seek a function y(x) that maximizes (or minimizes) a functional F'[y]. That is, of all 
possible functions y(x), we wish to find the particular function for which the func- 
tional F'[y] is a maximum (or minimum). The calculus of variations can be used, for 
instance, to show that the shortest path between two points is a straight line or that 
the maximum entropy distribution is a Gaussian. 

If we were not familiar with the rules of ordinary calculus, we could evaluate 
a conventional derivative dy/ dx by making a small change € to the variable x and 
then expanding in powers of e, so that 


d 
ylz +€) = y(x) + e+ Ole) (B.1) 
and finally taking the limit €e —> 0. Similarly, for a function of several variables 
y(%1,...,@p), the corresponding partial derivatives are defined by 
D by 
i = es ea + Ole’). B.2 
y(ti + €1,-..,@D +€p) = y(a1,.--,eD) p> ant FO). B.2) 


The analogous definition of a functional derivative arises when we consider how 
much a functional F'[y] changes when we make a small change en(x) to the function 
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Figure B.1 A functional derivative can be defined by 
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considering how the value of a functional 
Fy] changes when the function y(x) is 
changed to y(x) + en(x) where n(x) is an 
arbitrary function of x. 


y(x), where n(x) is an arbitrary function of x, as illustrated in Figure B.1. We denote 
the functional derivative of F'[y] with respect to y(x) by 6F'/dy(x) and define it by 
the following relation: 


OF 
Fly(x) + en(x)] = Fly(x)] + Jf ia” dz + O(e?). (B.3) 
This can be seen as a natural extension of (B.2) in which F'[y] now depends on a 
continuous set of variables, namely the values of y at all points x. Requiring that the 
functional be stationary with respect to small variations in the function y(x) gives 


OF 
dy (x) 


Because this must hold for an arbitrary choice of n(x), it follows that the functional 
derivative must vanish. To see this, imagine choosing a perturbation n(x) that is zero 
everywhere except in the neighbourhood of a point Z, in which case the functional 
derivative must be zero at x = X. However, because this must be true for every 
choice of %, the functional derivative must vanish for all values of zx. 

Consider a functional that is defined by an integral over a function G(y, y’, x), 
which depends on both y(x) and its derivative y’(x) and has a direct dependence on 
x: 


n(x)dz = 0. (B.4) 


Fly] = : G (u(x), y'(«), ©) de (B.5) 


where the value of y(x) is assumed to be fixed at the boundary of the region of 
integration (which might be at infinity). If we now consider variations in the function 
y(x), we obtain 


OG OG 
Piula) + na) = Fle) +e f {Fula Ero} ae +O). B0 
We now have to cast this in the form (B.3). To do so, we integrate the second term 
by parts and note that (a) must vanish at the boundary of the integral (because y(x) 
is fixed at the boundary). This gives 


ðG d (5 


Fly(x) + en(x)] = Fly(x)] 4 JE a ar) fe) dx + O(c?) (B.7) 
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from which we can read off the functional derivative by comparison with (B.3). 
Requiring that the functional derivative vanishes then gives 


ðG da (dG 
ee (5) 0, (B.8) 


which are known as the Euler-Lagrange equations. For example, if 
G = y(x)? + (y' (2)? (B9) 
then the Euler-Lagrange equations take the form 


dy 
y(x) T 0. (B.10) 
This second-order differential equation can be solved for y(x) by making use of the 
boundary conditions on y(x). 

Often, we consider functionals defined by integrals whose integrands take the 
form G(y, x) and that do not depend on the derivatives of y(x). In this case, station- 
arity simply requires that OG/Oy(x) = 0 for all values of z. 

If we are optimizing a functional with respect to a probability distribution, then 
we need to maintain the normalization constraint on the probabilities. This is often 
most conveniently done using a Lagrange multiplier, which then allows an uncon- 
strained optimization to be performed. 

The extension of the above results to a multi-dimensional variable x is straight- 
forward. For a more comprehensive discussion of the calculus of variations, see 
Sagan (1969). 


Appendix C. Lagrange Multipliers 


Lagrange multipliers, also sometimes called undetermined multipliers, are used to 
find the stationary points of a function of several variables subject to one or more 
constraints. 

Consider the problem of finding the maximum of a function f (x1, 72) subject to 
a constraint relating xı and z2, which we write in the form 


g(z1, £2) = 0. (C.1) 


One approach would be to solve the constraint equation (C.1) and thus express x as 
a function of x; in the form z = h(a). This can then be substituted into f (21, £2) 
to give a function of zı alone of the form f (21, h(a1)). The maximum with respect 
to x, could then be found by differentiation in the usual way, to give the stationary 
value x}, with the corresponding value of x given by x3 = h(a7). 

One problem with this approach is that it may be difficult to find an analytic 
solution of the constraint equation that allows x2 to be expressed as an explicit func- 
tion of xı. Also, this approach treats xı and x2 differently and so spoils the natural 
symmetry between these variables. 

A more elegant, and often simpler, approach introduces a parameter A called a 
Lagrange multiplier. We shall motivate this technique from a geometrical perspec- 
tive. Consider a D-dimensional variable x with components x1,...,%p. The con- 
straint equation g(x) = 0 then represents a (D — 1)-dimensional surface in x-space 
as indicated in Figure C.1. 

First note that at any point on the constraint surface, the gradient V g(x) of the 
constraint function is orthogonal to the surface. To see this, consider a point x that 
lies on the constraint surface along with a nearby point x + e that also lies on the 
surface. If we make a Taylor expansion around x, we have 


g(x + €) ~ g(x) + €' Va(x). (C.2) 
Because both x and x + € lie on the constraint surface, we have g(x) = g(x + €) and 


hence €' Vg(x) ~ 0. In the limit |le|| — 0, we have €"V g(x) = 0, and because € is 
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Figure C.1 A geometrical picture of the technique of Lagrange Vi (x) 

multipliers in which we seek to maximize a func- 

tion f(x), subject to the constraint g(x) = 0. If 

x is D dimensional, the constraint g(x) = 0 cor- XA 
responds to a subspace of dimensionality D — 1, 

as indicated by the red curve. The problem can 

be solved by optimizing the Lagrangian function 

L(x, A) = f(x) + Ag(x). 


g(x) =0 


then parallel to the constraint surface g(x) = 0, we see that the vector Vg is normal 
to the surface. 

Next we seek a point x* on the constraint surface such that f(x) is maximized. 
Such a point must have the property that the vector V f(x) is also orthogonal to the 
constraint surface, as illustrated in Figure C.1, because otherwise we could increase 
the value of f(x) by moving a short distance along the constraint surface. Thus, V f 
and Vg are parallel (or anti-parallel) vectors, and so there must exist a parameter A 
such that 

Vf+AVg =0 (C.3) 


where  # 0 is known as a Lagrange multiplier. Note that \ can have either sign. 
At this point, it is convenient to introduce the Lagrangian function defined by 


L(x, A) = f(x) + Ag(x). (C.4) 


The constrained stationarity condition (C.3) is obtained by setting VxL = 0. Fur- 
thermore, the condition 0L/OX = 0 leads to the constraint equation g(x) = 0. 
Thus, to find the maximum of a function f(x) subject to the constraint g(x) = 0, 
we define the Lagrangian function given by (C.4) and we then find the stationary 
point of L(x, A) with respect to both x and A. For a D-dimensional vector x, this 
gives D + 1 equations that determine both the stationary point x* and the value of À. 
If we are interested only in x*, then we can eliminate \ from the stationarity equa- 
tions without needing to find its value (hence, the term ‘undetermined multiplier’ ). 
As a simple example, suppose we wish to find the stationary point of the function 
f (21,22) = 1 — z£? — x3 subject to the constraint g(x1, 72) = 7, + £2 — 1 = 0, as 
illustrated in Figure C.2. The corresponding Lagrangian function is given by 


L(x, à) = 1 — z? — z3 + A(xı + z2 — 1). (C.5) 


The conditions for this Lagrangian to be stationary with respect to 71, £2, and À give 
the following coupled equations: 
—2%,+rA = 0 (C.6) 
—2z£ + À (C.7) 


| il 
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Figure C.2 


Figure C.3 


C. LAGRANGE MULTIPLIERS 623 


A simple example of the use of Lagrange multipliers 
in which the aim is to maximize f(xı, x2) = 1 — 
x} —2x3 subject to the constraint g(x1, 22) = 0 where 
g(x1,%2) = x1+22—1. The circles show contours of 
the function f (xı, x2), and the diagonal line shows 
the constraint surface g(x1, x2) = 0. 


Solving these equations then gives the stationary point as (x{, 73) = (1/2, 1/2), and 
the corresponding value for the Lagrange multiplier is \ = 1. 

So far, we have considered the problem of maximizing a function subject to an 
equality constraint of the form g(x) = 0. We now consider the problem of maxi- 
mizing f(x) subject to an inequality constraint of the form g(x) > 0, as illustrated 
in Figure C.3. 

There are now two kinds of solution possible, according to whether the con- 
strained stationary point lies in the region where g(x) > 0, in which case the con- 
straint is inactive, or whether it lies on the boundary g(x) = 0, in which case the 
constraint is said to be active. In the former case, the function g(x) plays no role 
and so the stationary condition is simply V f(x) = 0. This again corresponds to 
a stationary point of the Lagrange function (C.4) but this time with A = 0. The 
latter case, where the solution lies on the boundary, is analogous to the equality con- 
straint discussed previously and corresponds to a stationary point of the Lagrange 
function (C.4) with A # 0. Now, however, the sign of the Lagrange multiplier is 
crucial, because the function f(x) is at a maximum only if its gradient is oriented 
away from the region g(x) > 0, as illustrated in Figure C.3. We therefore have 
V f(x) = —AVg(x) for some value of A > 0. 

For either of these two cases, the product Ag(x) = 0. Thus, the solution to 


Illustration of the problem of maximizing V f(x) 
f(x) subject to the inequality constraint 
g(x) 2 0. 
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the problem of maximizing f(x) subject to g(x) > 0 is obtained by optimizing the 
Lagrange function (C.4) with respect to x and A subject to the conditions 


g(x) > 0 (C.9) 
> 0 (C.10) 
Ag(x) = 0. (C.11) 


These are known as the Karush—Kuhn—Tucker (KKT) conditions (Karush, 1939; 
Kuhn and Tucker, 1951). 

Note that if we wish to minimize (rather than maximize) the function f(x) sub- 
ject to an inequality constraint g(x) > 0, then we minimize the Lagrangian function 
L(x, A) = f(x) — Ag(x) with respect to x, again subject to À > 0. 

Finally, it is straightforward to extend the technique of Lagrange multipliers to 
cases with multiple equality and inequality constraints. Suppose we wish to maxi- 
mize f(x) subject to g;(x) = 0 for j =1,..., J, and hy, (x) > 0 fork =1,...,K. 
We then introduce Lagrange multipliers {A} and {ux}, and then optimize the La- 
grangian function given by 


J K 
L(x, {A}, u} = FO) + X Ajg) + Y urhe (x) (C.12) 
j=1 k=1 


subject to ug > 0 and ughg(x) = 0 for k = 1,..., K. Extensions to constrained 
functional derivatives are similarly straightforward. For a more detailed discussion 
of the technique of Lagrange multipliers, see Nocedal and Wright (1999). 
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AdaGrad, 223 

Adam optimization, 224 

adaptive rejection sampling, 435 
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adjoint sensitivity method, 556 
adversarial attack, 306 

aggregation, 415 

aleatoric uncertainty, 23 
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alpha family, 61 

amortized inference, 572 

ancestral sampling, 450 

anchor, 191 

annealed Langevin dynamics, 598 
AR model, see autoregressive model 
area under the ROC curve, 149 
artificial intelligence, 1 

attention, 358 

attention head, 366 

audio data, 399 

auto-associative neural network, see autoencoder 


autoencoder, 188, 563 

automatic differentiation, 22, 233, 244 
autoregressive flow, 552 
autoregressive model, 5, 350, 379 
average pooling, 297 


backpropagation, 19, 233 
backpropagation through time, 381 
bag of words, 378 

bagging, 278 

base distribution, 547 

basis function, 112, 158, 172, 172 
batch gradient descent, 214 
batch learning, 117 

batch normalization, 227 
Bayes net, 326 

Bayes’ theorem, 28 

Bayesian network, 326 
Bayesian probability, 54 
beam search, 386 

Bernoulli distribution, 66, 94 
Bernoulli mixture model, 481 
BERT, 388 

bi-gram model, 379 
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bias parameter, 112, 132, 180 
bias—variance trade-off, 123 
BigGAN, 539 
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bijective function, 548 continuous bag of words, 375 
binomial distribution, 67 continuous normalizing flow, 557 
bits, 46 contrastive divergence, 455 
blind source separation, 514 contrastive learning, 191 
blocked path, 339, 343 convex function, 51 
boosting, 279 convolution, 290, 322 
bootstrap, 278 convolutional network, 287 
bottleneck, 382 correlation matrix, 503 
bounding box, 309 cost function, 140 
Box—Muller method, 432 coupling flow, 549 
byte pair encoding, 377 coupling function, 552 

covariance, 35 
canonical correlation analysis, 501 Cox’s axioms, 54 
canonical link function, 164 cross attention, 390 
Cauchy distribution, 432 cross-correlation, 292, 322 
causal attention, 384 cross-entropy error function, 160, 162, 196 
causality, 347 cross-validation, 14 
central differences, 239 cumulative distribution function, 32 
central limit theorem, 71 curse of dimensionality, 172 
ChatGPT, 394 curve fitting, 6 
child node, 249, 327 CycleGAN, 539 
Cholesky decomposition, 433 
circular normal distribution, 89 d-separation, 338, 343, 479 
classical probability, 54 DAG, see directed acyclic graph 
classification, 3 data augmentation, 192, 257 
CLIP, 192 data compression, 465 
co-parents, 348 DDIM, 594 
codebook vector, 398, 465 DDPM, 581 
collider node, 341 decision, 120 
combining models, 146 decision boundary, 131, 139 
committee, 277 decision region, 131, 139 
complete data set, 476 decision surface, see decision boundary 
completing the square, 77 decision theory, 120, 138 
computer vision, 288 decoder, 563 
concave function, 52 deep double descent, 268 
concentration parameter, 92 deep learning, 20 
condition number, 220 deep neural networks, 20 
conditional entropy, 53 deep sets, 417 
conditional expectation, 35 DeepDream, 308 
conditional independence, 146, 337 degrees of freedom, 495 
conditional mixture model, 199 denoising, 581 
conditional probability, 27 denoising autoencoder, 567 
conditional VAE, 576 denoising diffusion implicit model, 594 
conditioner, 552 denoising diffusion probabilistic model, 581 


confusion matrix, 147 denoising score matching, 597 


density estimation, 37, 65 
dequantization, 526 
descendant node, 341 

design matrix, 116 
development set, 14 

diagonal covariance matrix, 75 
differential entropy, 50 
diffusion kernel, 583 

diffusion model, 581 

Dirac delta function, 34 
directed acyclic graph, 329 
directed cycle, 329 

directed factorization, 349 
directed graph, 326 

directed graphical model, 326 
discriminant function, 132, 143 
discriminative model, 144, 157, 346 
disentangled representations, 542 
distributed representation, 187 
dot-product attention, 363 
double descent, 268 

dropout, 279 


E step, 472, 476 

early stopping, 266 

earth mover’s distance, 538 

ECM, see expectation conditional maximization 
edge, 326, 410 

edge detection, 292 

ELBO, see evidence lower bound 
EM, see expectation maximization 
embedding space, 188 

embedding vector, 409 

encoder, 563 

energy function, 452 

energy-based models, 452 

ensemble methods, 277 

entropy, 46 

epistemic uncertainty, 23 

epoch, 215 

equality constraint, 623 

equivariance, 259, 292, 296, 371, 412 
erf function, 164 

error backpropagation, see backpropagation 
error function, 8, 55, 194, 210 
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Euler-Lagrange equations, 619 

evaluation trace, 247 

evidence lower bound, 485, 516, 570, 588 
expectation, 34 

expectation conditional maximization, 489 
expectation maximization, 470, 474, 517, 519 
expectation step, see E step 

expectations, 430 

explaining away, 343 

exploding gradient, 227, 382 

exponential distribution, 34, 431 
exponential family, 94, 156, 329 
expression swell, 245 


factor analysis, 513 

factor graph, 327 

factor loading, 513 

false negative, 25 

false positive, 25 

fast gradient sign method, 306 
fast R-CNN, 314 

feature extraction, 20, 113 
feature map, 291 

features, 179 

feed-forward network, 172, 193 
feed-forward networks, 19 
few-shot learning, 191, 394 
filter, 291 

fine-tuning, 3, 22, 189, 392 

flow matching, 558 

forward kinematics, 199 
forward problem, 198 

forward propagation, 235 
foundation model, 22, 358, 392, 409 
frequentist probability, 54 

fuel system, 341 

fully connected graphical model, 328 
fully convolutional network, 318 
functional, 617 


Gabor filters, 302 

gamma distribution, 434 

GAN, see generative adversarial network 
gated recurrent unit, 382 

Gaussian, 36, 70 

Gaussian mixture, 86, 200, 271, 466 
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GEM, see generalized EM algorithm 
generalization, 6 

generalized EM algorithm, 489 
generalized linear model, 158, 165 
generative adversarial network, 533 
generative AI, 4 

generative model, 4, 144, 346, 533 
generative pre-trained transformer, 6, 383 
geometric deep learning, 424 

Gibbs sampling, 446 

global minimum, 211 

GNN, see graph neural network 
GPT, see generative pre-trained transformer 
GPU, see graphics processing unit 
gradient descent, 209 

graph attention network, 421 

graph convolutional network, 414 
graph neural network, 407 

graph representation learning, 409 
graphical model, 326 

graphical model factorization, 329 
graphics processing unit, 20, 358 
group theory, 256 

guidance, 600 


Hadamard product, 550 
Hamiltonian Monte Carlo, 451 
handwritten digit, 501 

He initialization, 216 
head-to-head path, 341 
head-to-tail path, 340 

Heaviside step function, 161 
Hessian matrix, 211, 242 

Hessian outer product approximation, 243 
heteroscedastic, 200 

hidden Markov model, 380, 480 
hidden unit, 19, 180 

hidden variable, see latent variable 
hierarchical representation, 187 
histogram density estimation, 98 
history of machine learning, 16 
hold-out set, 14 

homogeneous Markov chain, 443 
Hooke’s law, 520 

Hutchinson’s trace estimator, 557 


hybrid Monte Carlo, 451 
hyperparameter, 14 


IAF, see inverse autoregressive flow 

ICA, see independent component analysis 
identifiability, 470 

IID, see independent and identically distributed 
image segmentation, 315 

ImageNet data set, 299 

importance sampling, 437, 450 
importance weight, 437 

improper distribution, 33 

improper prior, 263 

inactive constraint, 623 

incomplete data set, 476 

independent and identically distributed, 37, 344 
independent component analysis, 514 
independent factor analysis, 515 
independent variables, 31 

inductive bias, 19, 254 

inductive learning, 409, 420 

inequality constraint, 623 

inference, 120, 138, 143, 336 

InfoNCE, 191 

information theory, 46 

instance discrimination, 192 

internal covariate shift, 229 

internal representation, 308 
intersection-over-union, 310 

intrinsic dimensionality, 496 

invariance, 256, 256, 297, 412 

inverse autoregressive flow, 553 

inverse kinematics, 199 

inverse problem, 123, 198, 254, 346 

Iris data, 173 

IRLS, see iterative reweighted least squares 
isotropic covariance matrix, 75 

iterative reweighted least squares, 160 


Jacobian matrix, 44, 240 
Jensen’s inequality, 52 
Jensen—Shannon divergence, 544 


K nearest neighbours, 103 
K-means clustering algorithm, 460, 480 
Kalman filter, 353, 515 


Karush—Kuhn—Tucker conditions, 624 
kernel density estimator, 100, 596 

kernel function, 101 

kernel image, 291 

KKT, see Karush-Kuhn-Tucker conditions 


KL divergence, see Kullback—Leibler divergence 


Kosambi—Karhunen—Loéve transform, 497 
Kullback—Leibler divergence, 51, 486 


Lagrange multiplier, 621 
Lagrangian, 622 

Langevin dynamics, 454 
Langevin sampling, 455 

language model, 382 

Laplace distribution, 34 

large language model, 5, 382, 390 
lasso, 264 

latent class analysis, 481 

latent diffusion model, 601 

latent variable, 76, 335, 459, 495 
layer normalization, 229, 369 
LDM, see latent diffusion model 
LDS, see linear dynamical system 
leaky ReLU, 185 

learning curve, 223, 266 

learning rate parameter, 214 
learning to learn, 190 
least-mean-squares algorithm, 118 
least-squares GAN, 537 
leave-one-out, 15 

LeNet convolutional network, 299 
Levenberg—Marquardt approximation, 244 
likelihood function, 38, 468 
likelihood weighted sampling, 451 
linear discriminant, 132 

linear dynamical system, 515 
linear independence, 610 

linear regression, 6, 112 
linear-Gaussian model, 79, 332, 332 
linearly separable, 132 

link, see edge 

link function, 158, 165 

LLM, see large language model 
LMS, see least-mean-squares algorithm 
local minimum, 211 
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log odds, 151 

logic sampling, 450 

logistic regression, 159 

logistic sigmoid, 95, 113, 151, 159 
logit function, 151 

long short-term memory, 382 
LoRA, see low-rank adaptation 
loss function, 120, 140 

loss matrix, 142 

lossless data compression, 465 
lossy data compression, 465 
low-rank adaptation, 392 

LSGAN, see least-squares GAN 
LSTM, see long short-term memory 


M step, 472, 477 

macrostate, 48 

MABE, see masked autoencoder 

MAF, see masked autoregressive flow 
Mahalanobis distance, 71 

manifold, 177, 522 

MAP, see maximum a posteriori 
marginal probability, 27 

Markov blanket, 347, 449 

Markov boundary, see Markov blanket 
Markov chain, 351, 442 

Markov chain Monte Carlo, 440 
Markov model, 351 

Markov random field, 327 

masked attention, 384 

masked autoencoder, 567 

masked autoregressive flow, 553 
max-pooling, 297 

max-unpooling, 317 

maximization step, see M step 
maximum a posteriori, 56, 477 
maximum likelihood, 38, 84, 115, 153 
MCMC, see Markov chain Monte Carlo 
MDN, see mixture density network 
mean, 36 

mean value theorem, 49 

measure theory, 33 

mel spectrogram, 399 
message-passing, 414 
message-passing neural network, 415 
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meta-learning, 190 

Metropolis algorithm, 441 
Metropolis—Hastings algorithm, 445 
microstate, 48 

mini-batches, 216 

minimum risk, 145 

Minkowski loss, 122 

missing at random, 477, 519 
missing data, 519 

mixing coefficient, 87 

mixture component, 87 
mixture density network, 198 
mixture distribution, 459 
mixture model, 459 

mixture of Gaussians, 86, 200, 271, 466 
MLP, see multilayer perceptron 
MNIST data, 495 

mode collapse, 536 

model averaging, 277 

model comparison, 9 

model selection, 14 

moment, 37 

momentum, 220 

Monte Carlo dropout, 280 
Monte Carlo sampling, 429 


Moore-Penrose pseudo-inverse, see pseudo-inverse 


MRF, see Markov random field 
multi-class logistic regression, 161 
multi-head attention, 366 
multilayer perceptron, 18, 172 
multimodal transformer, 394 
multimodality, 199 

multinomial distribution, 70, 95 
multiplicity, 48 

multitask learning, 190 

mutual information, 54 


n-gram model, 379 

naive Bayes model, 147, 344, 378 
nats, 47 

natural language processing, 374 
natural parameter, 94 
nearest-neighbours, 103 
neocognitron, 302 

Nesterov momentum, 221 


neural ordinary differential equation, 554 
neuroscience, 302 

NLP, see natural language processing 

no free lunch theorem, 255 

node, 326, 410 

noise, 23 

noiseless coding theorem, 47 

noisy-OR, 354 

non-identifiability, 513 

non-max suppression, 314 
nonparametric methods, 66, 98 

normal distribution, see Gaussian 

normal equations, 116 

normalized exponential, see softmax function 
novelty detection, 144 


object detection, 308 

observed variable, 335 

Old Faithful data, 86 

on-hot encoding, see 1-of-K encoding 
one-shot learning, 191 
one-versus-one classifier, 134 
one-versus-the-rest classifier, 134 
online gradient descent, 215 
online learning, 117 

ordered over-relaxation, 449 
outer product approximation, 244 
outlier, 137, 144, 164 
over-fitting, 10, 123, 470 
over-relaxation, 449 
over-smoothing, 422 


padding, 294 

parameter sharing, 270, 331 

parameter shrinkage, 118 

parameter tying, see parameter sharing 
parent node, 247, 327 

partition function, 452 

Parzen estimator, see kernel density estimator 
Parzen window, 101 

PCA, see principal component analysis 
perceptron, 17 

periodic variables, 89 

permutation matrix, 411 

PixelCNN, 397 

PixeIRNN, 397 


plate, 334 

polynomial curve fitting, 6 
pooling, 296 

positional encoding, 371 
positive definite covariance, 72 
positive definite matrix, 615 
posterior collapse, 577 
posterior probability, 31 
power method, 498 
pre-activation, 17 
pre-processing, 20 
pre-training, 189, 392 
precision matrix, 77 

precision parameter, 36 
predictive distribution, 42, 120 
prefix prompt, 394 

principal component analysis, 497, 506, 565 
principal subspace, 497 

prior, 263 

prior knowledge, 19, 255 
prior probability, 31, 145 


probabilistic graphical model, see graphical model 


probabilistic PCA, 506 

probability, 25 

probability density, 32 

probability theory, 23 

probit function, 164 

probit regression, 163 

product rule of probability, 26, 28, 326 
prompt, 394, 601 

prompt engineering, 394 

proposal distribution, 433, 437, 441 
pseudo-inverse, 116, 136 
pseudo-random numbers, 430 


quadratic discriminant, 153 


radial basis functions, 179 

random variable, 26 

raster scan, 397 

readout layer, 419 

real NVP normalizing flow, 549 

receiver operating characteristic, see ROC curve 
receptive field, 290, 416 

recurrent neural network, 380 

regression, 3 
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regression function, 121 
regularization, 12, 253 
regularized least squares, 118 
reject option, 142, 145 
rejection sampling, 433 

relative entropy, 51 
reparameterization trick, 574 
representation learning, 22, 188 
residual block, 275 

residual connection, 22, 274 
residual network, 275 

resnet, see residual network 
responsibility, 88, 468 

RLHF, 394 

RMS error, see root-mean-square error 
RMS Prop, 223 

RNN, see recurrent neural network 
robot arm, 198 

robustness, 137 

ROC curve, 148 
root-mean-square error, 10 


saliency map, 305 

same convolution, 294 

sample mean, 39 

sample variance, 39 

sampling, 429 
sampling-importance-resampling, 439 
scale invariance, 256 

scaled self-attention, 366 

scaling hypothesis, 358 

Schur complement, 79 

score function, 455, 594 

score matching, 594 

self-attention, 362 

self-supervised learning, 5, 375 
semi-supervised learning, 420 
sequential estimation, 85 

sequential gradient descent, 118 
sequential learning, 117 

SGD, see stochastic gradient descent 
shared parameters, see parameter sharing 
shared weights, 292 

shattered gradients, 274 

shrinkage, 13 
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sigmoid, see logistic sigmoid 

singular value decomposition, 117 

SIR, see sampling-importance-resampling 
skip-grams, 375 

skip-layer connections, 274 

sliding window, 311 

smoothing parameter, 100 

soft ReLU, 185 

soft weight sharing, 271 

softmax function, 96, 152, 197, 201, 363 
softplus activation function, 185 

sparse autoencoders, 566 

sparse connections, 292 

sparsity, 264 

sphering, 504 

standard deviation, 36 

standardizing, 462, 503 

state-space model, 352 

Statistical bias, see bias 

Statistical independence, see independent variables 
steepest descent, 214 

Stein score, see score function 
Stirling’s approximation, 48 

stochastic, 8 

stochastic differential equation, 598 
stochastic gradient descent, 19, 214, 215 
stochastic variable, 26 

strided convolution, 294 

strides, 311 

structured data, 287, 407 

style transfer, 320 

sufficient statistics, 67, 69, 84, 97 

sum rule of probability, 26, 28, 326 
sum-of-squares error, 8, 41, 136 
supervised learning, 3, 420 

support vector machine, 179 

SVD, see singular value decomposition 
SVM, see support vector machine 
swish activation function, 205 
symmetry, 256 

symmetry breaking, 216 


tail-to-tail path, 339 
tangent propagation, 258 
temperature, 387 


tensor, 194, 295 

test set, 10, 14 

text-to-speech, 400 

tied parameters, see parameter sharing 
token, 360 

tokenization, 377 

training set, 3 

transductive, 409, 419 
transductive learning, 420 
transfer learning, 3, 189, 218, 388 
transformers, 357 

transition probability, 443 
translation invariance, 256 
transpose convolution, 318 
tri-gram model, 379 

TTS, see text-to-speech 


U-net, 319 

undetermined multiplier, see Lagrange multiplier 
undirected graphical model, 327 

uniquenesses, 513 

universal approximation theorems, 182 
unobserved variable, see latent variable 
unsupervised learning, 4, 188 

utility function, 140 


VAE, see variational autoencoder 
valid convolution, 294 

validation set, 14 

vanishing gradient, 227, 382 
variance, 35, 36, 125 

variational autoencoder, 569 
variational inference, 485 
variational lower bound, see evidence lower bound 
vector quantization, 398, 465 
vertex, see node 

vision transformer, 395 

von Mises distribution, 89 

voxel, 289 


Wasserstein distance, 538 
Wasserstein GAN, 538 
wavelets, 114 

weakly supervised, 192 
weight decay, 13, 260 
weight parameter, 17, 180 


weight sharing, see parameter sharing 
weight vector, 132 

weight-space symmetry, 185 

WGAN, see Wasserstein GAN 
whitening, 502 

Woodbury identity, 610 

word embedding, 375 

word2vec, 375 

wrapped distribution, 94 


Yellowstone National Park, 86 
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